HDFGroup / hsds

Cloud-native, service based access to HDF data
https://www.hdfgroup.org/solutions/hdf-kita/
Apache License 2.0
128 stars 52 forks source link

GET on datasets created with `fillvalue=np.NaN` give invalid JSON #87

Closed loichuder closed 3 years ago

loichuder commented 3 years ago

Say a dataset was created by h5py in the following fashion:

h5file.create_dataset("name", shape=[5], fillvalue=np.NaN)

The dataset info created by GET_Dataset produces then a JSON with a field: "creationProperties": {"fillValue": NaN}.

Unquoted NaN is not supported by the JSON spec so this cannot be parsed:

JSON.parse("{\"fillValue\": NaN}")
// Uncaught SyntaxError: Unexpected token N in JSON at position 14

The presence of this unsupported value is explained the JSON serialization which is made by aiohttp.web.json_response which in turn use json.dumps. The default behaviour of json.dumps is to allow unquoted NaN.

A possible way to solve this would be to parametrize aiohttp.web.json_response to quote NaN or to send null instead:

JSON.parse("{\"fillValue\": \"NaN\"}")
// { fillValue: 'NaN' }

Edit: This issue also pops when getting data that contains NaN

jreadey commented 3 years ago

I've added a query option "ignore_nan" for GET values that will return null rather than NaN in the json response. Try out the nanvalue branch. Commit is here: https://github.com/HDFGroup/hsds/commit/66a0dc0bef166eb78149b952a601171222528d82.

Haven't added support for sql query operations or attribute data yet, but let me know if this resolves the immediate issue.

BTW - Is there a mean to use binary responses from javascript? (guess you would need the javascript equivalent of numpy). Binary vs JSON would be more efficient esp. for larger responses.

loichuder commented 3 years ago

I've added a query option "ignore_nan" for GET values that will return null rather than NaN in the json response. Try out the nanvalue branch. Commit is here: 66a0dc0.

Haven't added support for sql query operations or attribute data yet, but let me know if this resolves the immediate issue.

Yup, it fixes the issue for NaN/Infinity in dataset data. However, as you said, NaN in attribute data or in fillValue (which get serialized in the metadata) will still produce invalid JSON.

BTW - Is there a mean to use binary responses from javascript? (guess you would need the javascript equivalent of numpy). Binary vs JSON would be more efficient esp. for larger responses.

This is something that we are actively exploring right now as it would indeed be more efficient and solve such issues.

For the record, in h5web, we are using ndarray as the "JS numpy" which make projects like js-numpy-parser especially interesting to us.

jreadey commented 3 years ago

I've updated the branch to return the string "nan" in the fillValue for GET dataset. As with values, you'll need to add "ignore_nan" in the query option. See the commit here: https://github.com/HDFGroup/hsds/commit/641eb6b59892c2bdf2ef7e196116e2fd74dd5c31.

I'll look at attribute support next.

jreadey commented 3 years ago

And added attribute support here: https://github.com/HDFGroup/hsds/commit/e8441629fabe3ac83c6c7ff0a778e33a75566f30.

I think this should be it for supporting possible NaN data responses.

jreadey commented 3 years ago

Merged to master. Will close issue - please reopen if you run into any problems.

loichuder commented 3 years ago

Sorry for not following this up earlier. It works fine :+1: !

The only missing piece would be that the JSON is still invalid when fillValue is set to (-)Infinity.

To be completely upfront, we changed our approach on h5web: we search for NaN/Infinity in the text response to replace them by their string equivalent before JSON deserialization. This allows to have valid JSON and still distinguish between NaN and Infinity, which is something we cannot do with ignore_nan as it converts both to null.