DaanVanVugt / h5pickle

Wrapper for h5py with pickle capabilities
MIT License
25 stars 8 forks source link

Can't open more than one file #7

Closed zachjweiner closed 5 years ago

zachjweiner commented 5 years ago

Currently, importing a second (third, etc.) file only loads the original (first) file again.

Reproducing code:

from h5py import File

names = ['a', 'b', 'c']
for name in names:
    f = File(name+'.h5', 'w')
    f.close()

from h5pickle import File

files = [File(name+'.h5', 'r') for name in names]

results in the list

[<HDF5 file "a.h5" (mode r)>,
 <HDF5 file "a.h5" (mode r)>,
 <HDF5 file "a.h5" (mode r)>]  
DaanVanVugt commented 5 years ago

Hi Zach, thanks for reporting this, I'll look into it.

DaanVanVugt commented 5 years ago

Okay, I found the issue, it's a very silly bug (return keyword missing on line 88 of h5pickle/init.py. I'll release a new version and might look into creating a test suite next month.

DaanVanVugt commented 5 years ago

I have pushed and tagged version 0.4 on master and released a new version to PyPI: https://pypi.org/manage/project/h5pickle/release/0.4/

thank you for reporting this :)

zachjweiner commented 5 years ago

Thanks for the (very quick!) fix!

Small note - it seems some debugging printing made it into the release (line 110). (I often open several hundred files at once, so I'd be happy to fork and modify if you don't want to push to PyPI again.)

DaanVanVugt commented 5 years ago

That's what I get for trying to be quick. upgrading a minor version is no issue, can do it soon. If you wish to use >100 files it might pay off to play with the LRUcache parameters, I think 100 is the max now.

PR's are also welcome, if you'd like to see other features.

DaanVanVugt commented 5 years ago

there is now a 0.4.1 on pypi without the print statement.

zachjweiner commented 5 years ago

Thanks again, and for the tip about the LRUcache - it was indeed a problem, and a simple hack is to just import and redefine cache:

import h5pickle as h5py
h5py.cache = h5py.LRUFileCache(250)
# load files

However, I'm getting the same "Can't read data (wrong B-tree signature)" errors as with regular h5py when reading files within a multiprocessing.Pool. It's quite possible I'm misunderstanding the purpose of h5pickle - should I expect to be able to read files from multiple threads in a pool?

DaanVanVugt commented 5 years ago

I mostly used h5pickle to open a single file multiple times (the Single Writer Multiple Reader model), have not so much experience with opening hundreds of files in multiple processes.

Are you writing to the files? If not it might help to explicitly open them with the 'r' mode if you are not already. Are multiple processes opening the same file? If not you are better of sending the filenames and opening with h5py locally per process.

https://github.com/pandas-dev/pandas/issues/12236 might be relevant for you as well