dandi / dandi-cli

DANDI command line client to facilitate common operations
https://dandi.readthedocs.io/
Apache License 2.0
21 stars 25 forks source link

[Feature] Support upload of Zarr-backend NWB files #1310

Open CodyCBakerPhD opened 1 year ago

CodyCBakerPhD commented 1 year ago

Possibly related to #1307, but specific to NWB format files using the Zarr-backend

I'd like to be able to upload a .nwb file written using PyNWB+HDMF-Zarr to the DANDI archive, but the dandi upload command was unable to recognize the file at all, and didn't even warn that it had been found and skipped for some reason

An example file for testing purposes may be found here, which was forced through using devel options, specifically --allow-any-path

CodyCBakerPhD commented 1 year ago

Some work may also be needed with representation of NWB assets for Zarr back-end - no 'i' info button appears on the asset, and the API also fails to recognize the file as an asset, but rather every individual item blob is its own asset (this I had initially expected given the underlying structures of the Zarr store - but on Slack Roni had indicated that each Zarr chunk was not supposed to be a separate AssetBlob, which is what we are seeing below)

from dandi.dandiapi import DandiAPIClient

client = DandiAPIClient(api_url="https://api-staging.dandiarchive.org/api")
dandiset = client.get_dandiset(dandiset_id="204919")

dandiset.get_asset_by_path(path="test_read_nwbfile/test_hdf5.nwb") 

works as expected, but

dandiset.get_asset_by_path(path="test_read_nwbfile/test_zarr.nwb")

gives

ValueError                                Traceback (most recent call last)
File /opt/conda/lib/python3.10/site-packages/dandi/dandiapi.py:1155, in RemoteDandiset.get_asset_by_path(self, path)
   1152 try:
   1153     # Weed out any assets that happen to have the given path as a
   1154     # proper prefix:
-> 1155     (asset,) = (
   1156         a for a in self.get_assets_with_path_prefix(path) if a.path == path
   1157     )
   1158 except ValueError:

ValueError: not enough values to unpack (expected 1, got 0)

During handling of the above exception, another exception occurred:

NotFoundError                             Traceback (most recent call last)
Cell In[21], line 1
----> 1 dandiset.get_asset_by_path(path="test_read_nwbfile/test_zarr.nwb")

File /opt/conda/lib/python3.10/site-packages/dandi/dandiapi.py:1159, in RemoteDandiset.get_asset_by_path(self, path)
   1155     (asset,) = (
   1156         a for a in self.get_assets_with_path_prefix(path) if a.path == path
   1157     )
   1158 except ValueError:
-> 1159     raise NotFoundError(f"No asset at path {path!r}")
   1160 else:
   1161     return asset

NotFoundError: No asset at path 'test_read_nwbfile/test_zarr.nwb'

and if I do

list(dandiset.get_assets())

I see

[RemoteBlobAsset(client=<dandi.dandiapi.DandiAPIClient object at 0x7fc5162c4400>, identifier='fd8e3782-b0c7-4bd5-89fe-e2acc0263744', path='test_read_nwbfile/test_hdf5.nwb', size=197512, created=datetime.datetime(2023, 7, 17, 15, 31, 55, 641893, tzinfo=datetime.timezone.utc), modified=datetime.datetime(2023, 7, 17, 15, 58, 44, 778333, tzinfo=datetime.timezone.utc), blob='6a61bab5-0662-49e5-be46-0b9ee9a27297', dandiset_id='204919', version_id='0.230717.1558'),
 RemoteBlobAsset(client=<dandi.dandiapi.DandiAPIClient object at 0x7fc5162c4400>, identifier='a78dfc02-9cd5-402a-83c8-5006fb18d5e8', path='test_read_nwbfile/test_zarr.nwb/acquisition/ElectricalSeries/data/0.0', size=46, created=datetime.datetime(2023, 7, 17, 15, 57, 45, 173503, tzinfo=datetime.timezone.utc), modified=datetime.datetime(2023, 7, 17, 15, 58, 44, 787050, tzinfo=datetime.timezone.utc), blob='1419744b-36f6-4c28-a850-71d381fc90e5', dandiset_id='204919', version_id='0.230717.1558'),
 RemoteBlobAsset(client=<dandi.dandiapi.DandiAPIClient object at 0x7fc5162c4400>, identifier='cd9faf76-cb4e-4849-b9eb-c838958676d1', path='test_read_nwbfile/test_zarr.nwb/acquisition/ElectricalSeries/electrodes/0', size=56, created=datetime.datetime(2023, 7, 17, 15, 57, 45, 215932, tzinfo=datetime.timezone.utc), modified=datetime.datetime(2023, 7, 17, 15, 58, 44, 795464, tzinfo=datetime.timezone.utc), blob='e8131c7e-095d-4242-ab4c-1658c8c3f5c5', dandiset_id='204919', version_id='0.230717.1558'),
 RemoteBlobAsset(client=<dandi.dandiapi.DandiAPIClient object at 0x7fc5162c4400>, identifier='383ece04-8db0-4207-843a-86109259a5cd', path='test_read_nwbfile/test_zarr.nwb/acquisition/ElectricalSeries/starting_time/0', size=24, created=datetime.datetime(2023, 7, 17, 15, 57, 45, 222857, tzinfo=datetime.timezone.utc), modified=datetime.datetime(2023, 7, 17, 15, 58, 44, 909428, tzinfo=datetime.timezone.utc), blob='a1f46f4a-d8ec-4183-bd8c-8ed530e963e4', dandiset_id='204919', version_id='0.230717.1558'),
 RemoteBlobAsset(client=<dandi.dandiapi.DandiAPIClient object at 0x7fc5162c4400>, identifier='871186e8-ac63-4c5e-b914-8b9246f7326a', path='test_read_nwbfile/test_zarr.nwb/file_create_date/0', size=56, created=datetime.datetime(2023, 7, 17, 15, 57, 45, 253174, tzinfo=datetime.timezone.utc), modified=datetime.datetime(2023, 7, 17, 15, 58, 44, 806273, tzinfo=datetime.timezone.utc), blob='9d7115fb-3133-437d-9168-7058e8fd84b6', dandiset_id='204919', version_id='0.230717.1558'),

....

and so on (the entire NWB file content listed out as separate blobs)

CodyCBakerPhD commented 1 year ago

The context the asset ID part is that I want to be able to stream the content using fsspec just like with HDF5 files

PyNWB can easily do this given the S3 asset of the HDF5, so I had thought that it would be just as easy if I had the asset ID of the Zarr folder (the 'test_zarr.nwb' file)

satra commented 1 year ago

@CodyCBakerPhD - i'm pretty positive what's happening here is the non-recognition of zarr on the CLI side and hence it's simply using the non-zarr route, which then the server interprets as individual blobs. so a fix on the CLI side that treats it as zarr would fix it. can you simply try adding the .zarr extension to test?

CodyCBakerPhD commented 1 year ago

Well, that is interesting...

Making a copy of the file with the name test_zarr.nwb.zarr (also confirmed same behavior with test_zarr.zarr) allows for dandi upload to appear as expected

image

however, nothing new appears on the dandiset view: https://gui-staging.dandiarchive.org/dandiset/204919/0.230717.1558/files?location=test_read_nwbfile%2F

or the API requests.

I also confirmed the asset made it to the bucket by attempting re-upload, to which it responds by saying the file already exists and so does not re-upload it

satra commented 1 year ago

@CodyCBakerPhD - you have stumped me. perhaps @AlmightyYakob has an answer to why that asset doesn't show up.

jjnesbitt commented 1 year ago

The file is present, the link you provided points to a previously published version, and so won't show any files uploaded to the draft verison. You can see the file here: https://gui-staging.dandiarchive.org/dandiset/204919/draft/files?location=test_read_nwbfile

CodyCBakerPhD commented 1 year ago

@AlmightyYakob Aha, yes that was it! Thank you for the sanity check

Would this workflow perhaps 'simply work' if I just naively add ".nwb" to the list of accepted Zarr entities? I'll try that out locally and see