HDFGroup / hsds

Cloud-native, service based access to HDF data
https://www.hdfgroup.org/solutions/hdf-kita/
Apache License 2.0
126 stars 52 forks source link

Implement recurse until hard link by h5path #224

Closed mattjala closed 1 year ago

mattjala commented 1 year ago

Implemented the 'recurse_until_hard' option for clients which, when used with h5path, tells the server to keep following symbolic links until it finds a hard link (e.g. an actual object). This allows a client to get any object by path in a single request.

This adds support for external link traversal in getObjectIdByPath.

This is backwards-compatible with old clients.

On the icesat2-benchmark, this decreased the time taken from ~4 seconds (with the link traversal changei n #219) to ~3.6 seconds.

mattjala commented 1 year ago

Have the SN code validate any query params before passing on to the DN. For booleans, I've been passing 1 in the query param and then on the receiving end, just doing: if "x" in params and params["x"]:....

Done, the sn now checks for the existence of recurse_until_hard and disable_type_check before passing them on.

mattjala commented 1 year ago

I find the params arg in getObjectJson a little odd. Why not just add any thing you want as keys in params as named parameters in the function?

It seems like it's a little redundant to 1. Unwrap the individual parameters from a request 2. Pass them as individual arguments to getObjectJson, then 3. Rewrap them into another params object for the call to http_get, especially since getObjectJson doesn't do anything itself with fields like include_links or include_attrs.

The only benefit I can see to using the named parameters is that it's a little more legible about what parameters it's prepared to pass on.

jreadey commented 1 year ago

More legible is a plus! Anyway, I'm confused why getObjectJson is calling respJsonAssemble... Is the latter just repacking the json with a check on nan values?

mattjala commented 1 year ago

respJsonAssemble was a bit of a hacky workaround to do some work on the response that was originally done elsewhere.

Specifically, a client using recurse_until_hard without knowing the type is always sending their request under /groups/, and so it gets routed through group_sn. This means that some dataset specific work that was previously done in dset_sn (right here) had to be moved so that it could still get done before the response was sent back.

jreadey commented 1 year ago

getObjJson is used in a lot of other places other than returning from a GET request, so I'd leave that as is. How about just adding a function to util/httpUtil.py that could be called by group_sn.py, dset_sn.py or ctype_sn.py just before they need to return a response? Just realized the clients are making requests to /groups/ and getting a dataset back. That's a bit confusing - how are they expected to know what type of object they got back? The hrefs in the response would need to be adjusted (e.g. "self" wouldn't be /groups/ if it's actually a dataset.

No test cases?

mattjala commented 1 year ago

Just realized the clients are making requests to /groups/ and getting a dataset back. That's a bit confusing - how are they expected to know what type of object they got back?

The URI, at least for objects in HSDS, contains a prefix indicating its type (d, t, g, c), and I had the REST VOL check that to determine the type of object returned. Before, knowing the path but not the type would mean the client needed to spend an entire request learning the type before making the actual request to the correct header.

getObjJson is used in a lot of other places other than returning from a GET request, so I'd leave that as is. How about just adding a function to util/httpUtil.py that could be called by group_sn.py, dset_sn.py or ctype_sn.py just before they need to return a response?

This should be possible, but it would still involve a pretty similar process of checking the URI to determine type.

The hrefs in the response would need to be adjusted (e.g. "self" wouldn't be /groups/ if it's actually a dataset.

No test cases?

I'll make these changes, add some tests, and update the PR

mattjala commented 1 year ago

I originally made recurse_until_hard a request parameter in order to preserve behavior where, by default, you could use something like ?h5path=group1/group1.1/link to get info about the link itself, whether it was hard or symbolic. But I realized that was inconsistent with how HSDS works for hard links - if a hard link was specified by h5path, it returned information about the actual object, not the link itself.

So the "default" behavior was inconsistent between link types, and probably less useful to users. So I decided to make the recurse_until_hard behavior just how HSDS always operates. This is still fully backwards compatible, since it is a pure expansion of what HSDS can support.

jreadey commented 1 year ago

@mattjala - Re: recurseUntilHard - isn't it just: GET /groups//links/ that gets info about the link itself? I thought GET /groups/?h5path=/foo/bar would always return the object pointed to by /foo/bar.

Anyway, I guess the larger issue for the client is that you have a h5path but don't know a priori if it references a dataset or group (or datatype). How about you add a handler to: GET /?h5path= (in domain_sn.py) that would return JSON for whichever object referenced by path?

Additional query params would be: include_attrs, follow_slinks, follow_extlinks.

It might be a good idea for the JSON response to include a "class" key to indicate what type of object is being returned.

jreadey commented 1 year ago

Ok, I see. Still I like "parent_id" better as a name the "search_root_object". The later sounds more like a verb.