dandi / dandi-cli

DANDI command line client to facilitate common operations
https://dandi.readthedocs.io/
Apache License 2.0
22 stars 27 forks source link

Partial Zarr Directory Updates in Dandi and LINC #1474

Open aaronkanzer opened 4 months ago

aaronkanzer commented 4 months ago

Cc @dstansby @kabilar @satra @yarikoptic @waxlamp @balbasty

In the LINC project, @dstansby encountered a scenario where an update was requested for a portion of a Zarr directory. Currently, DANDI and LINC treat a Zarr directory as a single object tree, requiring the entire directory to be downloaded even for updates that only modify specific pieces.

Downloading the entire Zarr directory can be inefficient, especially for large datasets where only a small portion needs updating.

This issue's purpose is to capture the need for mechanism to allow for partial updates of Zarr directories within Dandi and LINC.

Analagous, @satra suggested the initial usage of zarrita to explore elements of sharding, with perhaps the LINC project as a place to test

dstansby commented 4 months ago

I think there are two separate, but related issues here (and solving 2. depends on solving 1. first):

  1. Updating a single file within a dandiset. It would be very useful to document the workflow for doing this. So far the best I have come up with is:
  1. The same as above, but for editing the metadata of a zarr directory. As @aaronkanzer says, it's treated as a single object tree so there's no obvious way to only download the metadata file and then re-upload it.
yarikoptic commented 4 months ago
  1. non-zarr case: so it is possible but just inconvenient as for "Manually re-create the directory structure". I have created a dedicated issue

to boil down/implement desired convenience.

NB upon trying different URI schemas I found that there is a "workaround side-effect" if path is used as a glob (might not be generally applicable/desired), then we would get leading path too

❯ dandi download https://dandiarchive.org/dandisets/000027/versions/0.210831.2033/assets/\?glob\=sub-RAT123/sub-RAT123.nwb
PATH                      SIZE     DONE    DONE% CHECKSUM STATUS          MESSAGE
sub-RAT123/sub-RAT123.nwb 18.8 kB  18.8 kB  100%    ok    done                   
Summary:                  18.8 kB  18.8 kB                1 done                 
                                   100.00%      
❯ datalad clone https://github.com/dandisets/000027
[INFO   ] Remote origin not usable by git-annex; setting annex-ignore                                                                                                               
[INFO   ] https://github.com/dandisets/000027/config download failed: Not Found                                                                                                     
[INFO   ] access to 2 dataset siblings dandi-dandisets-dropbox, dandiapi not auto-enabled, enable with:
|       datalad siblings -d "/tmp/000027" enable -s SIBLING 
install(ok): /tmp/000027 (dataset)
❯ cd 000027
❯ datalad get sub-RAT123/sub-RAT123.nwb
get(ok): sub-RAT123/sub-RAT123.nwb (file) [from web...]                                                                                                                             
❯ ls -lL sub-RAT123/sub-RAT123.nwb
-r--r--r-- 1 yoh yoh 18792 Jul 18 07:51 sub-RAT123/sub-RAT123.nwb
# now edit / dandi upload
  1. zarr. In general it is possible, but very inconvenient as would require download of a full zarr first.

For an "ultimate" solution, we need to add some basic zarr navigator, related

to make it easier for a user to get desired "full" URL to specific zarr component.

As for update of metadata only it would be quite tricky AFAIK to implement correctly but indeed editing metadata is a valid use case. ATM it is 'possible' only via full zarr download, and I believe we would avoid reuploading any file which was not modified (@jwodder might correct me if I am wrong). As for partial download and upload of zarr -- I think we would also need support for that in the client:

kabilar commented 3 months ago

Thanks team. Moving this issue to the DANDI Client repo, as it doesn't seem like we would need changes to the web app or REST API.