HDFGroup / hsds

Cloud-native, service based access to HDF data
https://www.hdfgroup.org/solutions/hdf-kita/
Apache License 2.0
128 stars 52 forks source link

GET_Value return 0 for numeric scalar datasets #100

Closed loichuder closed 2 years ago

loichuder commented 3 years ago

I try to serve with HSDS POSIX a simple HDF5 file with scalars of different types:

with h5py.File("scalars.h5", "w") as h5file:
    h5file["int"] = 100
    h5file["float"] = 0.5
    h5file["string"] = "This"
    h5file["bool"] = True

Requesting the values of the string and bool datasets yield the correct values. However, when trying to get the values of int or float, the request succeeds but I get 0 instead of the actual value.

The service node logs tell that the chunk is not found:

WARN> Object: http://.../chunks/c-184fe88e-e05d36be-6839-3d1b68-8fe306_0 not found

But the request goes through just fine (following logs):

INFO> chunk_arr shape: (1,)
INFO> data_sel: (slice(0, 1, 1),)
INFO> read_chunk_hyperslab gather complete, arr shape: (1,)
DEBUG> GET Value - returning JSON data
INFO RSP> <200> (OK): /datasets/d-184fe88e-e05d36be-6839-3d1b68-8fe306/value
Additional chunk debug logs
DEBUG> isExtensible - dims: [1] maxdims: [1]
DEBUG> getSliceQueryParam: 0, 1
DEBUG> start: 0, stop: 1, step: 1
DEBUG> dim query[0] returning: start:0 stop:1 step:1
DEBUG> GET Value selection: (slice(0, 1, 1),)
DEBUG> selection shape:[1]
DEBUG> selection num elements: 1
DEBUG> num_chunks: 1
DEBUG> not using serverless for read on 1 chunks - nonstrict not specified
DEBUG> getChunkIds - selection: (slice(0, 1, 1),)
DEBUG> getChunkIds - layout: [1]
DEBUG> skip getChunkInfoMap for layout class: H5D_CHUNKED
DEBUG> chunkinfo_map: None
INFO> doHyperSlabRead - number of chunk_ids: 1
DEBUG> doHyperSlabRead - chunk_ids: ['c-184fe88e-e05d36be-6839-3d1b68-8fe306_0']
jreadey commented 2 years ago

Hey, sorry for the delay in responding to this.

It looks like something wonky is going on with implicit dataset creation - I'll look into it.

If I use the create_dataset method, you get the expected result. I ran this code:

with h5pyd.File(filepath, "w") as h5file:
    print(f"root id: {h5file.id.id}")
    h5file.create_dataset("int", data=100)
    h5file.create_dataset("float", data=0.5)
    h5file.create_dataset("string", data=b"This"). # using bytes value for fixed-length type
    h5file.create_dataset("bool", data=True)

bounced the server (to verify stuff got written to storage), and than ran this:

with h5pyd.File(filepath, "r") as h5file:
    print(f"root id: {h5file.id.id}")
    dset_int = h5file["int"]
    x = dset_int[()]
    print(f"int: {x}")
    dset_float = h5file["float"]
    x = dset_float[()]
    print(f"float: {x}")
    dset_str = h5file["string"]
    x = dset_str[...]
    print(f"string: {x}")
    dset_bool = h5file["bool"]
    x = dset_bool[()]
    print(f"bool: {x}")

and got the following output:

 int: 100
 float: 0.5
 string: b'This' 
 bool: True

So that looks correct.

Strange that you got the chunk not found error. In general the HDF REST API doesn't return a 400 error when a chunk doesn't exist. Just like with h5py, if you read a chunk that hasn't been written to, you'll get zeros (or the fill value if that is set).

But when writing to a dataset, you should see the DN node write out each chunk. If you grep the dn logs for "s3sync", you'll see the flow as objects get written.

Also, since you are using posix, you can verify the contents just by listing the contents of the dataset directory for that file. E.g, I can dump out the contents using the first part of the root id for the file (/Users/john/mybuckets is the ROOT_DIR and hsdstest is the BUCKET_NAME for my setup):

$ find /Users/john/mybuckets/hsdstest/db/01157d5c-dffe82f4 -type f -exec ls -l {} \;
-rw-r--r--  1 john  staff  655 Sep  2 12:31 /Users/john/mybuckets/hsdstest/db/01157d5c-dffe82f4/.group.json
-rw-r--r--  1 john  staff  1082 Sep  2 12:32 /Users/john/mybuckets/hsdstest/db/01157d5c-dffe82f4/.info.json
-rw-r--r--  1 john  staff  380 Sep  2 12:31 /Users/john/mybuckets/hsdstest/db/01157d5c-dffe82f4/d/c00f-6add24.-255138/.dataset.json
-rw-r--r--  1 john  staff  4 Sep  2 12:31 /Users/john/mybuckets/hsdstest/db/01157d5c-dffe82f4/d/c00f-6add24-255138/0
-rw-r--r--  1 john  staff  401 Sep  2 12:31 /Users/john/mybuckets/hsdstest/db/01157d5c-dffe82f4/d/9858-b25f0b-527c2d/.dataset.json
-rw-r--r--  1 john  staff  1 Sep  2 12:31 /Users/john/mybuckets/hsdstest/db/01157d5c-dffe82f4/d/9858-b25f0b-527c2d/0
-rw-r--r--  1 john  staff  334 Sep  2 12:31 /Users/john/mybuckets/hsdstest/db/01157d5c-dffe82f4/d/376c-adba71-a6cce1/.dataset.json
-rw-r--r--  1 john  staff  8 Sep  2 12:31 /Users/john/mybuckets/hsdstest/db/01157d5c-dffe82f4/d/376c-adba71-a6cce1/0
-rw-r--r--  1 john  staff  335 Sep  2 12:31 /Users/john/mybuckets/hsdstest/db/01157d5c-dffe82f4/d/9aec-6bbc25-a89027/.dataset.json
-rw-r--r--  1 john  staff  8 Sep  2 12:31 /Users/john/mybuckets/hsdstest/db/01157d5c-dffe82f4/d/9aec-6bbc25-a89027/0

Let me know if you get the expected results with the modified file. I'll update h5pyd to do the right thing for your original code.

jreadey commented 2 years ago

I've got a fix for implicit dataset creation: https://github.com/HDFGroup/h5pyd/commit/eaeb9d6b8bd83c7490cea6fc6ad9af3d03c56f14. It was just one of those h5py features that got marked with "todo" but never got implemented.

You'll need to build h5pyd from the jreadey-work branch for now to get the fix. No HSDS changes were needed.

@loichuder, if you get a chance to try it, please let me know if that resolved the issue.

loichuder commented 2 years ago

First, here are some things I had to do to make the jreadey-work branch work:

Dependencies will need to be updated.

I then reloaded my scalars.h5 file with the updated hsload --link. This worked but did not fix the issue: the GET_Value still give me 0 for int and float.

The SN still give me the same logs.

loichuder commented 2 years ago

Just realized I skimmed your previous comment, sorry !

Indeed, using this piece of code

with h5pyd.File(filepath, "w") as h5file:
    print(f"root id: {h5file.id.id}")
    h5file.create_dataset("int", data=100)
    h5file.create_dataset("float", data=0.5)
    h5file.create_dataset("string", data=b"This"). # using bytes value for fixed-length type
    h5file.create_dataset("bool", data=True)

works fine: I get the expected values for int and float.

Also, since you are using posix, you can verify the contents just by listing the contents of the dataset directory for that file.

Good call ! I had a look and the files 0 are missing for the int and float datasets while they are present for the bool and str datasets. When using h5pyd to create the file, the files 0 are present for all datasets.

This explains why the chunk is not found when the file/domain was created using hsload --link and why I get the placeholder value 0 instead.

jreadey commented 2 years ago

Hey @loichuder - Are you sure you are running with the h5pyd changes from the commit? In looking at this issue I realized that implicit dataset creation (i.e. assigning a ndarray or native type to a group key without the create_dataset call) wasn't working and looked to address that in the commit. Now when I run:

with h5pyd.File(filepath, "w") as h5file:
    print(f"root id: {h5file.id.id}")
    h5file["int"] = 100
    h5file["float"] = 0.5
    h5file["string"] = b"This"
    h5file["bool"] = True

the read program above works as expected:

$ python read.py
root id: g-e465276d-89a1deb0-6ced-afe501-295638
int: 100
float: 0.5
string: b'This'
bool: True

Thanks for pointing out the unix_requests import error - I've added that to setup.py and pushed changes to jreadey-work (along with some pyflakes fixes)

jreadey commented 2 years ago

Also, about the requests_unixsocket/requests/urllib3 imports...

Installing requests_unnixsocket should pull in requests and requests should grab urrlib3. So I thought it better to let these packages manage the required versions rather than doing it in the h5pyd setup.

If I pip install. unix_requests on a clean environment I get:

requests-unixsocket==0.2.0
requests==2.26.0
urllib3==1.26.6

Which looks right.

But maybe this breaks if you already have requests or urrlib3 installed?

The boto3 dependency should have been in the setup previously.

loichuder commented 2 years ago

Now when I run:

with h5pyd.File(filepath, "w") as h5file:
    print(f"root id: {h5file.id.id}")
    h5file["int"] = 100
    h5file["float"] = 0.5
    h5file["string"] = b"This"
    h5file["bool"] = True

the read program above works as expected:

Yes, using h5pyd directly works fine for me as well.

The issue is in fact present when I use h5py to create the file and then use h5pyd through hsload --link to load the file in HSDS.

Installing requests_unixsocket should pull in requests and requests should grab urrlib3. So I thought it better to let these packages manage the required versions rather than doing it in the h5pyd setup.

That would work if you were bound to the same requirements as these packages. But h5pyd is using features only available in urllib3>=1.26 while requests_unixsocket has a more relaxed urllib3>=1.8 requirement.

In my opinion, the requirement should be make explicit by adding urllib3>=1.26 in h5pyd and using directly urlilib3 rather than requests.packages.urllib3.

requests_unixsocket was in the same situation: they had the requests requirement which pulls urllib3 but still added the urllib3 requirement when they started using a feature from a particular version.

But that is up to you in the end :smile:

jreadey commented 2 years ago

Closing the issue - I'll need to re-visit the setup issue soon. Thinking to combine hsds and h5pyd repos and have combined setup.py.

loichuder commented 2 years ago

The issue is still present when I use h5py to create the file and then use h5pyd through hsload --link to load the file in HSDS.

jreadey commented 2 years ago

That's odd - I'll take a look.

jreadey commented 2 years ago

That's a bug - the chunk layout with the link info was getting ignored for scalar datasets. I put a fix in here: https://github.com/HDFGroup/h5pyd/commit/9a0af51cad0f5cda9af3a2ad281ef7c5a44069cf. With this change scalar datsets won't get linked back to the HDF5 file but just copied. Since they are presumably small in size this should not be an issue.

loichuder commented 2 years ago

Thanks for the great work: https://github.com/HDFGroup/h5pyd/commit/9a0af51cad0f5cda9af3a2ad281ef7c5a44069cf fixes the problem :slightly_smiling_face: :+1:

Since they are presumably small in size this should be an issue.

I think you meant this should not be an issue :smile:

jreadey commented 2 years ago

Yes, that's what I meant!