Closed zachjweiner closed 5 years ago
Hi Zach, thanks for reporting this, I'll look into it.
Okay, I found the issue, it's a very silly bug (return keyword missing on line 88 of h5pickle/init.py. I'll release a new version and might look into creating a test suite next month.
I have pushed and tagged version 0.4 on master and released a new version to PyPI: https://pypi.org/manage/project/h5pickle/release/0.4/
thank you for reporting this :)
Thanks for the (very quick!) fix!
Small note - it seems some debugging printing made it into the release (line 110). (I often open several hundred files at once, so I'd be happy to fork and modify if you don't want to push to PyPI again.)
That's what I get for trying to be quick. upgrading a minor version is no issue, can do it soon. If you wish to use >100 files it might pay off to play with the LRUcache parameters, I think 100 is the max now.
PR's are also welcome, if you'd like to see other features.
there is now a 0.4.1 on pypi without the print statement.
Thanks again, and for the tip about the LRUcache - it was indeed a problem, and a simple hack is to just import and redefine cache
:
import h5pickle as h5py
h5py.cache = h5py.LRUFileCache(250)
# load files
However, I'm getting the same "Can't read data (wrong B-tree signature)" errors as with regular h5py
when reading files within a multiprocessing.Pool
. It's quite possible I'm misunderstanding the purpose of h5pickle
- should I expect to be able to read files from multiple threads in a pool?
I mostly used h5pickle
to open a single file multiple times (the Single Writer Multiple Reader model),
have not so much experience with opening hundreds of files in multiple processes.
Are you writing to the files? If not it might help to explicitly open them with the 'r' mode if you are not already. Are multiple processes opening the same file? If not you are better of sending the filenames and opening with h5py locally per process.
https://github.com/pandas-dev/pandas/issues/12236 might be relevant for you as well
Currently, importing a second (third, etc.) file only loads the original (first) file again.
Reproducing code:
results in the list