dealing with 65G hdf5 file

deter3 commented 7 years ago

when I moved a 65G hdf5 file into data folder and using

"curl -X GET -H "host:sift_foursquare_shinjuku.rich.hdfgroup.org" http://127.0.0.1:5000/datasets"

server log keeps looping all the data like below for hours , and I can not get any response . h5pyd had the same situation by using f = h5pyd.File("sift_foursquare_shinjuku.rich.hdfgroup.org",endpoint="http://xx.xx.xx.xx:xx",mode="r")

INFO:hdf5db.py:570::visit: 4b058799f964a5209c9b22e3/3204 name: Dataset INFO:hdf5db.py:570::visit: 4b058799f964a5209c9b22e3/3205 name: Dataset INFO:h5watchdog.py:45::H5EventHandler -- Modified file: ../data/rich/sift_foursquare_shinjuku.h5 INFO:hdf5db.py:570::visit: 4b058799f964a5209c9b22e3/3206 name: Dataset INFO:hdf5db.py:570::visit: 4b058799f964a5209c9b22e3/3207 name: Dataset INFO:hdf5db.py:570::visit: 4b058799f964a5209c9b22e3/3208 name: Dataset INFO:h5watchdog.py:45::H5EventHandler -- Modified file: ../data/rich/sift_foursquare_shinjuku.h5 INFO:hdf5db.py:570::visit: 4b058799f964a5209c9b22e3/3209 name: Dataset INFO:hdf5db.py:570::visit: 4b058799f964a5209c9b22e3/321 name: Dataset INFO:h5watchdog.py:45::H5EventHandler -- Modified file: ../data/rich/sift_foursquare_shinjuku.h5 INFO:h5watchdog.py:45::H5EventHandler -- Modified file: ../data/rich/sift_foursquare_shinjuku.h5

Do you know how I can fix it ?

jreadey commented 7 years ago

That's strange. I need to see if there's some bad interaction between the watchdog timer and large files. In the meantime, try this:

Shutdown h5serv
Delete the .toc.h5 file in the data directory
In config.py under server, set the background_timeout to 0
Startup h5serv ($python app.py)

Let me know if that works.

deter3 commented 7 years ago

Did everything according to the instructions above , here is the error message . using

f = h5pyd.File("sift_foursquare_shinjuku.rich.hdfgroup.org",endpoint="http://xx.xx.xx.xx:5000",mode="r")

&

curl -X GET -H "host:sift_foursquare_shinjuku.rich.hdfgroup.org" http://127.0.0.1:5000/datasets

h5json (1.0.2) h5py (2.6.0) h5pyd (0.1.0) h5serv(0.2)

ERROR:tornado.application:Uncaught exception GET / (xx.xx.xx.xx) HTTPServerRequest(protocol='http', host='sift_foursquare_shinjuku.rich.hdfgroup.org', method='GET', uri='/', version='HTTP/1.1', remote_ip='xx.xx.xx.xx', headers={'Host': 'sift_foursquare_shinjuku.rich.hdfgroup.org', 'User-Agent': 'python-requests/2.9.1', 'Connection': 'keep-alive', 'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate'}) Traceback (most recent call last): File "/usr/local/lib/python2.7/dist-packages/tornado/web.py", line 1413, in _execute result = method(*self.path_args, **self.path_kwargs) File "app.py", line 2809, in get response = self.getRootResponse(self.filePath) File "app.py", line 2768, in getRootResponse rootUUID = db.getUUIDByPath('/') File "build/bdist.linux-x86_64/egg/h5json/hdf5db.py", line 712, in getUUIDByPath self.initFile() File "build/bdist.linux-x86_64/egg/h5json/hdf5db.py", line 546, in initFile self.dbGrp = self.f.create_group("__db__") File "/usr/local/lib/python2.7/dist-packages/h5py/_hl/group.py", line 49, in create_group gid = h5g.create(self.id, name, lcpl=lcpl) File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper (/tmp/pip-4rPeHA-build/h5py/_objects.c:2684) File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper (/tmp/pip-4rPeHA-build/h5py/_objects.c:2642) File "h5py/h5g.pyx", line 151, in h5py.h5g.create (/tmp/pip-4rPeHA-build/h5py/h5g.c:2756) ValueError: Unable to create group (Incorrect cache entry type) ERROR:tornado.access:500 GET / (xx.xx.xx.xx) 2.53ms

I tried other small size h5 file , no problem at all . I used h5py to get access to the same 65GB h5 file locally , no problem .

jreadey commented 7 years ago

I tried h5serv out with a 85GB file. The first request to the server took a few seconds, but after that it was fine.

I see you are running Python 2.6, but I've only tested h5serv with 2.7 and above.

If you have docker installed (or are willing to install it), an easy way to deal with possible dependency issues is the run the h5serv docker image from docker hub: https://hub.docker.com/r/hdfgroup/h5serv/.

Just run:

  $ docker run -d -v <h5serv_install_location>/h5serv/data:/data -p 5000:5000 --name h5serv hdfgroup/h5serv

This command will pull down the docker image and start running the container as a daemon.

Once it's running you access it through the local port just as before.

Use:

 $ docker logs h5serv

To inspect the log files and:

$ docker stop h5serv

To shut down the container.

deter3 commented 7 years ago

I tried the docker and it did not work out with 65GB file , other small size hdf5 files worked fine . What I found are 2 issues :

I need to have file named with extension as h5 instead of hdf5 , otherwise , it won't recognize the file .
I got the error "KeyError: 'Unable to open object (Wrong object header chunk signature)' ". I did not specify the chunk size when I saved the array to hdf5 file .

Even I am using the docker image , I still got the same issue which is h5pyd.File pending for response for long time just like my issue being stated in the post 1 . Meantime , I had no problem in using h5py.File to get local access the 65GB hdf5 file .

Do you think i shall save the arrays to hdf5 file again with chunk size specified to get h5serv working ?

jreadey commented 7 years ago

I wouldn't think it would matter if h5py is able to read the file.

Can you make the file available (say ftp or S3) for me to download? That would make it easier to investigate.

deter3 commented 7 years ago

I sent an email to jreadey@hdfgroup.org with the ink already . Let me know if you have any problem .

jreadey commented 7 years ago

Downloading now. It will take a few hours.

jreadey commented 7 years ago

Now I have the file I can see what the problem is!

In the file there are 3469 groups and 10,881,629 datasets. Most of the HDF5 files I have seen have a much smaller number of relatively larger datasets. The reason this causes an issue with h5serv, is that on the first time it gets a request to access a file, it creates a special db group in the file that contains UUIDs for each group and dataset, so it needs to iterate through each object in the file.

I created a little utility script that lets you create the db index offline so to speak: https://github.com/HDFGroup/h5serv/blob/develop/util/rebuildIndex.py. You can run this on the file, and then move it into the h5serv data directory once it's built the db group. Now the server will see that the db group is already present and you'll get an immediate response for a request like: http://127.0.0.1:5000/?host=sift_foursquare_shinjuku.hdfgroup.org.

When I ran the rebuild.py on my pc (a i7 w/ 32GB ram) it took 133 minutes to update the file. File size increased by 12%.

In your intended application, will you be fetching all the links for a particular group? Since you have groups with 1000's of links this will be a bit slow as well. In the REST API, the GetLinks operation enables you to paginate through a link collection: http://h5serv.readthedocs.io/en/latest/GroupOps/GET_Links.html. That should work a bit better than trying to grab all the links in one go.

deter3 commented 7 years ago

Thanks for the reply .

My applications will need to get one link for a group every time or multiple links if multiprocessing . There's no requirement to fetch all links in a particular group . Bur here are some problems I am having .

I will need to update the hdf5 file on irregular basis in the future ,which means delete all datasets in a group and rebuild all datasets . It means the db index file will be required to update every time manually .

my question is : Is there any plan to implement db index file automatically update in the near future ?

If I update all arrays into hdf5 file , which is 100 times bigger than current file , number of datasets will be around 1 billion , will there be any risk and performance issues for one hdf5 file to manage this file size ? Or I will have to divide the file into couple of smaller file , which is possible . But again , the manually update db index file will be a big pain .
Is it okay for multiprocessing read and write for one hdf5 file by using h5pyd and h5serv ? I did not enable MPI when compiled hdf5 source file .
Or is there any other data structure which has similar hierarchical data format with multiprocessing support if there is no plan to update db index automatically ? I only know Zarr might have similar hierarchical data format .

jreadey commented 7 years ago

If the file is being updated outside h5serv (i.e. an application is using the HDF5lib to create datasets, groups, etc), you would need to rebuild the db index. The db index does update automatically if a new object is added through the REST API, but obviously it won't update if the file is being directly modified.

For the other questions:

I've seen HDF5 files in the 300GB range, but dealing with really large files can be problematic. For one you need a filesystem large enough to hold it! Also in the unlikely event that the file becomes corrupted, you'd be forced to rebuild the file from scratch. If there's a sensible way to divide your data in to multiple files, that is probably best. Also, is it possible to restructure your data so that you can use fewer datasets? That would make the rebuild faster.
Yes you can use multiprocessing read/writes with h5pyd and h5serv. Changes to the file or serialized on the backend, so this won't help much from a performance perspective. BTW, we're working now on a scalable REST server that will support true parallelism. I should have something to share in a few months.
There's seems to be an almost infinite number of different database systems. Neo4J works well for graph databases. And you could use something like mongodb and just manage the hierarchy yourself.

Your project looks very interesting. Please keeps us updated on how it goes.

jreadey commented 7 years ago

Closing the issue - re-open if you have more questions.

HDFGroup / h5serv

dealing with 65G hdf5 file #103