HDFGroup / h5pyd

h5py distributed - Python client library for HDF Rest API
Other
114 stars 38 forks source link

Allow for appending to existing hsds "files" using hsload #86

Closed MRossol closed 4 years ago

MRossol commented 4 years ago

Change mode of fin from "x" to "a" in obj_creation_handler check if obj exists before creating

jreadey commented 4 years ago

Interesting -- I hadn't considered an append version of hsload!

I'd suggest adding an "--append" command line flag so that users don't inadvertently append to a domain when they didn't mean too.

What about root attributes (see: line 626 in utillib.py)? If there's a collision with an existing attribute, then you should skip and output a warning message like you do with links.

MRossol commented 4 years ago

I have my own version of this for some of the WTK data. Its stored across multiple (11) .h5 files. Each file has a set of unique datasets and then redundant meta and time_index datasets. In HSDS I just loaded them into a single "file" which is very useful.

Great comments, will do on the --append flag.

I did a pretty significant refactor locally that I can push up that fixes some potential pitfalls, like re-copying existing data.

jreadey commented 4 years ago

Ok - I just checked in some minor changes so you might want to merge those first.

MRossol commented 4 years ago

Added the append option (I'm used to using click so make sure I integrated it correctly with sys.argv...) and changed the object_helper approach to do a single cycle through the objects and skip existing objects (needed during append)

jreadey commented 4 years ago

I had forgotten about this! I'll take a look tomorrow.

MRossol commented 4 years ago

It kept on getting pushed down my todo list... so no rush!

jreadey commented 4 years ago

There are a few edge cases where the creating links and coping the objects would cause problems. e.g. you have multiple links pointing to the same object.

Would it cause problems to go back to the multi-pass approach?

MRossol commented 4 years ago

Whats the multi-pass approach?

jreadey commented 4 years ago

It's like this in the current code:

# build a rough map of the file using the internal function above
logging.info("creating target objects")
fin.visititems(object_create_helper)
# copy over any attributes
logging.info("creating target attributes")
fin.visititems(object_attribute_helper)
# create soft/external links (and hardlinks not already created)
create_links(fin, fout, ctx)  # create root soft/external links
fin.visititems(object_link_helper)
if dataload:
    # copy dataset data
    logging.info("copying dataset data")
    fin.visititems(object_copy_helper)
else:
    logging.info("skipping dataset data copy (dataload is None)")
MRossol commented 4 years ago

I guess I still don't follow. I made two structural changes: 1) Instead of iterating through all of the .h5 objects multiple times (create, add attributes, links, copy) I just iterate once and create add attributes, add links, and copy as needed 2) I skip #1 if the obj already exists in fout. This is essential for my use case because the file/dataset structure looks like this: file_1.h5: -time_index

where time_index and meta are redundant and would break your code as they already exist and can't be re-created.

MRossol commented 4 years ago

This section of the new object_helper function is just a combination of the old object_create_helper, object_link_helper, object_copy_helper, and object_attribute_helper functions

jreadey commented 4 years ago

I'll need to do some testing to convince myself the approaches are equivalent. And there's another bug with importing files with multilinks (to links to the same object) that I need to fix.
I'll try to get to this next week, but in the meantime you shouldn't be blocked from using your version.

jreadey commented 4 years ago

Hey @MRossol - could you merge in the latest changes from master and re-submit? Also, please verify the non-append case works. I got some errors testing this, but it may have just been my bad resolve conflicts.

jreadey commented 4 years ago

@MRossol - Do you get this error with doing a regular (non-append) hsload?

$ hsload ~/tall.h5 /home/test_user1/test/tall2.h5 Traceback (most recent call last): File "/opt/anaconda3/envs/py38/bin/hsload", line 11, in <module> load_entry_point('h5pyd==0.7.2', 'console_scripts', 'hsload')() File "/opt/anaconda3/envs/py38/lib/python3.8/site-packages/h5pyd-0.7.2-py3.8.egg/h5pyd/_apps/hsload.py", line 353, in main load_file(fin, fout, verbose=verbose, dataload=dataload, s3path=s3path, deflate=deflate,) File "/opt/anaconda3/envs/py38/lib/python3.8/site-packages/h5pyd-0.7.2-py3.8.egg/h5pyd/_apps/utillib.py", line 686, in load_file fin.visititems(object_helper) File "/opt/anaconda3/envs/py38/lib/python3.8/site-packages/h5py/_hl/group.py", line 565, in visititems return h5o.visit(self.id, proxy) File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper File "h5py/h5o.pyx", line 355, in h5py.h5o.visit File "h5py/defs.pyx", line 1641, in h5py.defs.H5Ovisit_by_name File "h5py/h5o.pyx", line 302, in h5py.h5o.cb_obj_simple File "/opt/anaconda3/envs/py38/lib/python3.8/site-packages/h5py/_hl/group.py", line 564, in proxy return func(name, self[name]) File "/opt/anaconda3/envs/py38/lib/python3.8/site-packages/h5pyd-0.7.2-py3.8.egg/h5pyd/_apps/utillib.py", line 639, in object_helper if name in fout: UnboundLocalError: local variable 'fout' referenced before assignment

MRossol commented 4 years ago

@jreadey All bugs fixed:

(hsds) [mrossol@el3 h5pyd]$ hsload --link ${S3_FILE} ${HSDS_FILE}
(hsds) [mrossol@el3 h5pyd]$ echo $S3_FILE
s3://nrel-pds-wtk/vietnam/wtk_vietnam_2016.h5
(hsds) [mrossol@el3 h5pyd]$ echo $HSDS_FILE
/nrel/wtk/vietnam/wtk_vietnam_2016.h5
(hsload) [mrossol@el3 rechunk]$ hsls /nrel/wtk/vietnam/wtk_vietnam_2016.h5
/ Group
coordinates Dataset {1037610, 2}
inversemoninobukhovlength_2m Dataset {8784, 1037610}
meta Table {1037610}
precipitationrate_0m Dataset {8784, 1037610}
pressure_0m Dataset {8784, 1037610}
pressure_100m Dataset {8784, 1037610}
pressure_200m Dataset {8784, 1037610}
relativehumidity_2m Dataset {8784, 1037610}
temperature_100m Dataset {8784, 1037610}
temperature_10m Dataset {8784, 1037610}
temperature_120m Dataset {8784, 1037610}
temperature_140m Dataset {8784, 1037610}
temperature_160m Dataset {8784, 1037610}
temperature_200m Dataset {8784, 1037610}
temperature_2m Dataset {8784, 1037610}
temperature_40m Dataset {8784, 1037610}
temperature_60m Dataset {8784, 1037610}
temperature_80m Dataset {8784, 1037610}
time_index Dataset {8784}
winddirection_100m Dataset {8784, 1037610}
winddirection_10m Dataset {8784, 1037610}
winddirection_120m Dataset {8784, 1037610}
winddirection_140m Dataset {8784, 1037610}
winddirection_160m Dataset {8784, 1037610}
winddirection_200m Dataset {8784, 1037610}
winddirection_40m Dataset {8784, 1037610}
winddirection_60m Dataset {8784, 1037610}
winddirection_80m Dataset {8784, 1037610}
windspeed_100m Dataset {8784, 1037610}
windspeed_10m Dataset {8784, 1037610}
windspeed_120m Dataset {8784, 1037610}
windspeed_140m Dataset {8784, 1037610}
windspeed_160m Dataset {8784, 1037610}
windspeed_200m Dataset {8784, 1037610}
windspeed_40m Dataset {8784, 1037610}
windspeed_60m Dataset {8784, 1037610}
windspeed_80m Dataset {8784, 1037610}
jreadey commented 4 years ago

Thanks @MRossol! Changes merged to master.

jreadey commented 4 years ago

I've belatedly recalled why the multi-pass approach is needed. Say you have a dataset, dset1, with attribute a1 that contains a reference to another dataset, dset2. If dset1 is created before dset2, the creation of a1 will fail because dset2 doesn't exist yet.

Sounds bizarre, but this type of structure is commonly used for dimension lists...