HDFGroup / h5pyd

h5py distributed - Python client library for HDF Rest API
Other
110 stars 39 forks source link

Enable hsload to work with Posix files #78

Closed jreadey closed 4 years ago

jreadey commented 4 years ago

hsload has an s3link option that enables ingest to just copy the metadata of an HDF5 file and use S3 range gets to access the chunks from the original HDF5 file (stored as an S3 object) as needed (see https://github.com/HDFGroup/hsds/blob/master/docs/design/single_object/SingleObject.md).

Now that HSDS supports posix files, it would bee nice if the same approach could be used in the posix case (HSDS metadata referring to chunks stored in traditional HDF5 files).

jreadey commented 4 years ago

@loichuder - I'm working on this now. Thought it would be nice to have an issue to track this so you can provide feedback on how it works with your file collection.

loichuder commented 4 years ago

Thanks ! I will be glad to try it out once you have something working

jreadey commented 4 years ago

I have the changes checked in now. You'll need to get the latest HSDS from master as well as the latest h5pyd code.

The syntax for linking to a traditional HDF5 files (just the metadata is stored in the HSDS format) is: $ hsload --link Path_to_HDF5 file /home/myuser/mylinkedfile.h5 where Path_to_HDF5 file is a relative path to the traditional HDF5 file, and /home/myuser/mylinkedfile.h5 is the target domain that will be created (if you just give a folder, it will create a domain in the folder with the basename of your link file)

Two tricky points -

@loichuder - if you can try this out with your files, let me know how it goes.

loichuder commented 4 years ago

@jreadey I managed to make it work on my local instance. Thanks for the hard work !

One question: I am unsure of how h5pyd interacts with h5py. h5py is not in the requirements of the package but yet, still needed ?

Now, is the implementation of a link to the values the next step ? Would this be difficult to do ?

jreadey commented 4 years ago

h5pyd only needs h5py when loading HDF5 files, so I've not made h5pyd a dependency.

When you do a hsload with the --link option, you should be able to see the data values. Is this not working?

loichuder commented 4 years ago

Ok I see.

No, my requests /datasets/{id}/value give 404 errors but I did not dig into it as you said in your previous comment:

just the metadata is stored in the HSDS format

jreadey commented 4 years ago

@loichuder - if you look at the .dataset.json file, what do you see? The layout class should be some type of H5D_XXX_REF. See: https://github.com/HDFGroup/hsds/blob/master/docs/design/single_object/SingleObject.md.
Do the other values in the .json file seem reasonable?

Also, it might be helpful to review the DN logs to see what happens during a GET value request.

loichuder commented 4 years ago

About the layout:

In the logs of the DN, I get WARN> chunk c-724e7440-66d102a7-1544-3bc806-5b3793_<dim>_0_0 not found 50 times (number of dimensions along first axis ?). While in the SN, I get probably corresponding warnings: WARN> s3path:<the correct file_uri> for S3 range get not found 50 times.

I also set the file_uri using parent directory traversal with ... Could this be a problem ?

loichuder commented 4 years ago

@jreadey : any thoughts given my previous comment ?

jreadey commented 4 years ago

@loichuder - sorry I missed your comment!

The dims in the layout refer to the chunk dimension not the dataset dimensions. To enable parallelism, hsload creates multiple chunks even when the target dataset is contiguous. So in this case there are 50 chunks with each chunk consisting of the last two dimensions. The chunks happen to be adjacent in the file, but that shouldn't matter to the service.

The '..' is likely causing problems. Try using the expanded path.

loichuder commented 4 years ago

No problem.

What do you mean by "expanded path" ? I have to supply a path relative to the root_dir, right ?

jreadey commented 4 years ago

Right. E.g. if root_dir is: /mnt/data/ and your hdf5 file is at: /mnt/data/hdf5/myfile.h5, use: hdf5/myfile.h5 as the path. (and you need to run this from the /mnt/data/ directory.)

loichuder commented 4 years ago

Ok this is what I thought. I tried this but same as before: I can only access metadata, requests to values gives 500 response code...

Really odd :thinking:

jreadey commented 4 years ago

Does it fail this way with all files are just certain ones? If the later, if you can put the file someplace where I can access, I can try myself.

Also, if you run with --loglevel debug, you may get some useful information.

loichuder commented 4 years ago

I get the same behaviour with all files: I can access links and metadata but not the values themselves.

Here is the log of the hsload for a file named checkboard containing a single (100 x 100) dataset dataset: log_load_checkboard.txt

Apparently, create_dataset is never called so that the creation of links to dataset values are never done.

EDIT: I was mistaken: the creation is indeed done and this is why I can access the metadata/links. It is the resolution of links to the dataset values that fail.

jreadey commented 4 years ago

That's strange... I've added some additional logging to utillib.py and fixed some spurious warnings that were showing up. If you try running with the latest changes that may be helpful.

For reference, here's what I'm doing to verify (I'm using the docker image hdfgroup/hdf5lib:1.10.6 to get the updated version of hdf5lib and h5py):

$ docker run --rm -v ~/.hscfg:/root/.hscfg  -v /mnt/data:/data -it hdfgroup/hdf5lib:1.10.6 bash
# hsinfo
server name: hsdstest
server state: READY
endpoint: http://192.168.1.100
username: test_user1
password: ****
home: /home/test_user1/
server version: 0.6_beta
node count: 1
up: 4 hours 14 min 12 sec
h5pyd version: 0.7.1
# cd /data
#  hsload --loglevel debug --logfile hsload.log --link hdf5/tall.h5 /home/test_user1/test/tall.h5
# exit
$ hsls -r  /home/test_user1/test/tall.h5
/ Group
/g1 Group
/g1/g1.1 Group
/g1/g1.1/dset1.1.1 Dataset {10, 10}
/g1/g1.1/dset1.1.2 Dataset {20}
/g1/g1.2 Group
/g1/g1.2/g1.2.1 Group
/g1/g1.2/g1.2.1/slink    SoftLink {somevalue}
/g1/g1.2/extlink         ExternalLink {somepath//somefile}
/g2 Group
/g2/dset2.1 Dataset {10}
/g2/dset2.2 Dataset {3, 5}
$ hsinfo  /home/test_user1/test/tall.h5
domain: /home/test_user1/test/tall.h5
owner:           test_user1
id:              g-fdd99e42-b4804654-7551-16ca3c-08cedc
last modified:   2020-04-21 23:54:40
total_size:      5809
allocated_bytes: 0
num objects:     10
num chunks:      0

For context, on the host I've set ROOT_DIR to /mnt/data/.
My HSDS BUCKET_NAME is hsds.test (so it maps to /mnt/data/hsds.test/) There's a directory /mnt/data/hdf5 with a regular hdf5 file: tall.h5 And for docker run I'm mounting /mnt/data to /data and my credentials (~/.hscfg) to /root/.hscg You'll notice that hsinfo reports 0 chunks for the domain, but if you run: h5pyd/examples/read_example.py, you can see that the dataset values show up correctly.

I pushed an updated hdfgroup/hdf5lib:1.10.6 to docker hub. Give that a try and see if you get the same results.

loichuder commented 4 years ago

Still no luck on my local instance so I will try with the docker image.

Do you have a reference on the format of .hscfg ?

loichuder commented 4 years ago

Nevermind, I used hsconfigure to generate it.

I tried with the docker image and the loading runs fine (log). hsinfo and hsls give the same results as you. But I still cannot access the data, I still get Error 500 which makes read_example.py and my requests fail.

The DN logs show: ERROR> s3path is invalid: tall.h5

loichuder commented 4 years ago

Update: it works ! :tada:

For some reason, I cannot access the values when the file is in the root_dir itself (root_dir/tall.h5 does not work) ! I had to the file in a folder as you did (root_dir/hdf5/tall.h5) for the access to values to work. And it also works when I do the load without the docker image !

jreadey commented 4 years ago

Great!

The reason why root_dir/tall.h5 doesn't work is that HSDS expects divides up the location between a bucket and a path. For AWS S3, buckets are how Amazon organizes their storage system (you can think of them as file volumes). Path is just the location within the bucket.

For the HSDS posix setup, root_dir defines the parent directory of all accessible buckets. i.e. any directory in root_dir can be accessed from HSDS via the bucket name. root_dir/tall.h5 doesn't work in your case since there's no bucket name to assign, whereas root_dir/hdf5/tall.h5 maps to "hdf5" as the bucket name and /tall.h5 as the path.

jreadey commented 4 years ago

Closing this issue since it looks like things are working.