HDFGroup / h5pyd

h5py distributed - Python client library for HDF Rest API
Other
111 stars 39 forks source link

Incorrect domains on external links #22

Closed rayosborn closed 7 years ago

rayosborn commented 7 years ago

We have an h5serv server running, and loading regular HDF5 files works well. However, if the file contains an external link, it cannot access the external file because the file path is not converted to a valid domain.

Here's some output when accessing a file with path mullite/mullite_300K.nxs with respect to the h5serv datapath, with the root domain name of exfac (the config file sets the file extension to be .nxs on this server):

>>> b=h5pyd.File('mullite_300K.mullite.exfac',  mode='r',  endpoint='http://some.server:5000')
>>> c=b['/entry/transform/v']

KeyErrorTraceback (most recent call last)
<ipython-input-5-610e75ab34f0> in <module>()
----> 1 c=b['/entry/transform/v']

/Users/rosborn/anaconda/envs/py27/lib/python2.7/site-packages/h5pyd/_hl/group.pyc in __getitem__(self, name)
    327             except IOError:
    328                 # unable to find external link
--> 329                 raise KeyError("Unable to open file: " + link_json['h5domain'])
    330             return f[link_json['h5path']]
    331 

KeyError: u'Unable to open file: 300K/transform.nxs' 

Presumably, h5pyd should convert the external file path to a valid domain string. In this case, the file path is relative to the parent HDF5 file - I'm not sure what a correct domain name would be if the file path was absolute.

jreadey commented 7 years ago

I've implemented some fixes for this in h5serv - please update your h5serv repo and try it out. My change always returns an absolute DNS-style name. Relative filenames should be ok.
It should also work with absolute filenames in the link that point to the correct location in the server data directory. Again it will return a DNS name.

There are a bunch of edge cases, but I think this should be good for common usage.

rayosborn commented 7 years ago

It works, although it is tripped up if the external file has a different extension than the default server extension. In my example, the file mullite/mullite_300K.h5 has an external link to 300K/transform.nxs. With .h5 as the default extension, I get:

>>> import h5pyd as h5
>>> a=h5.File('mullite_300K.mullite.exfac', mode='r', endpoint='http://34.193.81.207:5000')
>>> a['/entry/transform/v']
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-3-c6429c132039> in <module>()
----> 1 a['/entry/transform/v']

/Users/rosborn/anaconda/envs/py27/lib/python2.7/site-packages/h5pyd/_hl/group.pyc in __getitem__(self, name)
    349             except IOError:
    350                 # unable to find external link
--> 351                 raise KeyError("Unable to open file: " + link_json['h5domain'])
    352             return f[link_json['h5path']]
    353 

KeyError: u'Unable to open file: transform.nxs.300K.mullite.exfac'

So the non-default extension, .nxs is being included in the filename. If I rename the parent file to mullite/mullite_300K.nxs and restart the server with .nxs as the default extension, i.e., so it matches the extension of the linked file, then it works:

>>> a['/entry/transform/v']
<HDF5 dataset "v": shape (801, 901, 901), type "<f4">
>>> a.filename
 'mullite_300K.mullite.exfac'
>>> a['/entry/transform/v'].file
<HDF5 file "transform.300K.mullite.exfac" (mode r)>
rayosborn commented 7 years ago

There seems to be another problem. If a group contains an externally linked dataset and I call the group's items() method, the external link does not get resolved in the returned value, even though it does get resolved when referencing the dataset directly.

>>> a=h5.File('mullite_300K.mullite.exfac', mode='r', endpoint='http://34.193.81.207:5000')
>>> a['/entry/transform'].items()
[(u'Qk', <HDF5 dataset "Qk": shape (901,), type "<f8">),
 (u'Ql', <HDF5 dataset "Ql": shape (801,), type "<f8">),
 (u'Qh', <HDF5 dataset "Qh": shape (901,), type "<f8">),
 (u'v', None)] 
>>> a['/entry/transform/v']
<HDF5 dataset "v": shape (801, 901, 901), type "<f4">
jreadey commented 7 years ago

Ah, looks like I did the transform for the link operation, but not the links one. I've fixed that now, update your h5serv repo.

rayosborn commented 7 years ago

The items function now returns all the values, including the external links. Fixing this has uncovered another possible inconsistency with h5py, which I will post as another issue. Thanks for all the work.

jreadey commented 7 years ago

Good. I'll close this one then.