Closed yarikoptic closed 3 years ago
actually it might be much easier --- we don't need to tune get_dandiset_and_assets
or add a new command -- just a devel option --assetstore
option to download
command and do cp
instead of download if that one is found in the assetstore.
Records returned by get_dandiset_and_assets already include girder.id
:
$> dandi ls -r https://dandiarchive.org/dandiset/000027/draft
2020-09-17 11:15:53,755 [ INFO] Traversing remote dandisets (000027) recursively
- ...
- attrs:
ctime: '2020-07-21T22:00:36.362000+00:00'
mtime: '2020-07-21T17:31:55.283394-04:00'
size: 18792
girder:
id: 5f176584f63d62e1dbd06946
...
name: sub-RAT123.nwb
path: /sub-RAT123/sub-RAT123.nwb
...
type: file
so it will just be a helper function in download to resolve girder id to the path in the asset store, ideally cached.
Is it really necessary to implement this as a download option? It's not something that any normal user would need, and it would probably be cleaner if the whole thing was just a script that used dandi as a library.
Sure -- could just be an outside script. I just thought it might be simpler to just implement it within dandi-cli
as a DANDI_DEVEL
option.
Is there a recommended way to get a list of all Dandiset IDs? I know I can use Girder's /dandi
endpoint, but it lacks decent pagination support, and I'm not sure if one of the other API components has a better endpoint.
Also, should there be some sort of handling of Dandiset versions?
Problem: The Python on drogon is 3.5, yet this library requires 3.6.
Please just install miniconda in your HOME
with any suitable Python.
re list of dandisets:
In [11]: from dandi import girder
In [12]: cl = girder.get_client("https://girder.dandiarchive.org")
In [13]: [r['name'] for r in cl.listFolder("5e59bb0af19e820ab6ea6c62", parentFolderType='collection')]
Out[13]:
['000003',
'000004',
'000005',
'000006',
'000007',
'000008',
'000009',
'000010',
'000011',
'000012',
'000013',
'000015',
...
re miniconda: cut pasteable example from http://handbook.datalad.org/en/latest/intro/installation.html?highlight=hpc#linux-machines-with-no-root-access-e-g-hpc-systems
$ wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
$ bash Miniconda3-latest-Linux-x86_64.sh
# acknowledge license, keep everything at default
$ conda install -c conda-forge datalad
The asset for 000020/sub-746220081/sub-746220081_ses-751268829_icephys.nwb
(which should be at girder-assetstore/47/dc/47dc1e65b27441f28d5e7d9cf1109c12
) appears to be missing from the backup for some reason, making the script fail. Should failures of the cp
command just be ignored?
hm... for now please add an option for that -- for the "investigate metadata compliance" it is ok to miss a few files. BUT eventually we need to figure it out. (backup runs daily, so it might be that there were some changes to that dandiset today? will check later)
The script has finished running, taking 20 minutes and 22 seconds. Should I commit it to a top-level tools/
directory in this repository or place it somewhere else?
awesome! yes please - commit it under tools/
.
Due to the https://github.com/dandi/dandiarchive/issues/491 though we are lacking dandiset.yaml in each one of those. Could you please adjust the script to use dandi download --download dandiset.yaml
to instantiate all of them so we get them "more complete"?
Done. Pull request: https://github.com/dandi/dandi-cli/pull/244
To provide extensive testing for #226 on dandisets we have already in the archive, we need to download them all. But that would be increasingly prohibitive.
On
drogon
backup server we already have a datalad dataset with the backup of S3.The idea is to "instantiate" dandisets present in the archive as directories with symlinks (or could actually be actual files via
cp --refllink=always
since its BTRFS CoW filesystem!) into some location on the drive, where those "symlinks" would be coming from an asset store which is located under /mnt/backup/dandi/dandiarchive-s3-backup/girder-assetstore .The culprit is that asset store is using its own UUID, it is not an id of the girder's "file" . So we would need to either follow the redirect from dandiarchive's girder to
https://girder.dandiarchive.org/api/v1/file/{file['id']}/download
to get the actual asset id:to get that
girder-assetstore/74/0f/740feade0d784acc8ec76bb7834d80dc
path which is on drogon:which would work but somewhat inefficient (we could cache those since mapping should not change) but work, or load from mongodb back the entire table and get all the mappings (more work - probably not).
So I think the course of action could be to
GirderCli.get_dandiset_and_assets
so it would add resolved URLs like above "https://dandiarchive.s3.amazonaws.com/girder-assetstore/74/0f/740feade0d784acc..." to the returned records of the assetsdandi instantiate --assetstore PATH -o TOPPATH DANDISET_ID
command (present only inDANDI_DEVEL
mode) which would just go through all the assets of the dandiset and perform aforementionedcp -L --reflink=always {assetstore}/{path-within-assetstore-fromurl}
I think it should work quite fast and would be very efficient since no heavy data transfer would be happening and no new space consumed (besides for filesystem level metadata for COW copied files)