"instantiate" dandisets from the "backup"

yarikoptic commented 3 years ago

To provide extensive testing for #226 on dandisets we have already in the archive, we need to download them all. But that would be increasingly prohibitive.

On drogon backup server we already have a datalad dataset with the backup of S3.

The idea is to "instantiate" dandisets present in the archive as directories with symlinks (or could actually be actual files via cp --refllink=always since its BTRFS CoW filesystem!) into some location on the drive, where those "symlinks" would be coming from an asset store which is located under /mnt/backup/dandi/dandiarchive-s3-backup/girder-assetstore .

The culprit is that asset store is using its own UUID, it is not an id of the girder's "file" . So we would need to either follow the redirect from dandiarchive's girder to https://girder.dandiarchive.org/api/v1/file/{file['id']}/download to get the actual asset id:

$> curl -I https://girder.dandiarchive.org/api/v1/file/5f176584f63d62e1dbd06946/download     
HTTP/1.1 303 See Other
Server: nginx/1.14.0 (Ubuntu)
Date: Thu, 17 Sep 2020 14:42:28 GMT
Content-Type: text/html;charset=utf-8
Content-Length: 1652
Connection: keep-alive
Allow: DELETE, GET, HEAD, OPTIONS, PATCH, POST, PUT
Girder-Request-Uid: 6e2b2cbc-c6a3-4265-8068-14151b94f9cc
Location: https://dandiarchive.s3.amazonaws.com/girder-assetstore/74/0f/740feade0d784acc8ec76bb7834d80dc?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=ASIA3GIMZPVVEYHMC7MS%2F20200917%2Fus-east-2%2Fs3%2Faws4_request&X-Amz-Date=20200917T144228Z&X-Amz-Expires=3600&X-Amz-SignedHeaders=host&X-Amz-Security-Token=FwoGZXIvYXdzEBAaDKv1lZXvP9wFZRzEdCK%2FAZFXw8ch9QU9XsbYJneN4%2BIZTHUkUdu9P8xVvYlNNECKMEA25TTmbsKywS5YSDkTeY6x%2F67QDDHhbGRH89XanXXUejXHSk%2F5vU8MajEq0WV2iGMkpbYTUw9lFIlCAXnprmcDLd7LyTWCBi9tWpycrXD8YSUto3VUXG%2FTMjHOx4%2FG8CGi3I%2F1m3siPX7SQexDrmK7YpGI0jxEVYxF9sVvUtKeYF3PZWyX1b6KB0t%2BOOy4UCL%2FPRhW8gYvtHO%2F2EnxKIrrjfsFMi1WZZe2Ye%2FEJi6jx1xSE6nG%2B%2BdKQ%2BdHoigP06wBwHoLSaCdyIhVkoNGyk%2BEMt8%3D&X-Amz-Signature=cf09a8c3a24040939b756080de942bc79f171675ab70a4530a13e271b4adbe09
Strict-Transport-Security: max-age=63072000

to get that girder-assetstore/74/0f/740feade0d784acc8ec76bb7834d80dc path which is on drogon:

$> ls -l /mnt/backup/dandi/dandiarchive-s3-backup/girder-assetstore/74/0f/740feade0d784acc8ec76bb7834d80dc                 
lrwxrwxrwx 1 yoh yoh 125 Jul 21 18:00 /mnt/backup/dandi/dandiarchive-s3-backup/girder-assetstore/74/0f/740feade0d784acc8ec76bb7834d80dc -> ../../../.git/annex/objects/z8/3w/MD5E-s18792--33318fd510094e4304868b4a481d4a5a/MD5E-s18792--33318fd510094e4304868b4a481d4a5a

which would work but somewhat inefficient (we could cache those since mapping should not change) but work, or load from mongodb back the entire table and get all the mappings (more work - probably not).

So I think the course of action could be to

add an option "add_resolved_url" to GirderCli.get_dandiset_and_assets so it would add resolved URLs like above "https://dandiarchive.s3.amazonaws.com/girder-assetstore/74/0f/740feade0d784acc..." to the returned records of the assets
add dandi instantiate --assetstore PATH -o TOPPATH DANDISET_ID command (present only in DANDI_DEVEL mode) which would just go through all the assets of the dandiset and perform aforementioned cp -L --reflink=always {assetstore}/{path-within-assetstore-fromurl}

I think it should work quite fast and would be very efficient since no heavy data transfer would be happening and no new space consumed (besides for filesystem level metadata for COW copied files)

yarikoptic commented 3 years ago

actually it might be much easier --- we don't need to tune get_dandiset_and_assets or add a new command -- just a devel option --assetstore option to download command and do cp instead of download if that one is found in the assetstore.
Records returned by get_dandiset_and_assets already include girder.id :

$> dandi ls -r https://dandiarchive.org/dandiset/000027/draft      
2020-09-17 11:15:53,755 [    INFO] Traversing remote dandisets (000027) recursively
- ...
- attrs:
    ctime: '2020-07-21T22:00:36.362000+00:00'
    mtime: '2020-07-21T17:31:55.283394-04:00'
    size: 18792
  girder:
    id: 5f176584f63d62e1dbd06946
...
  name: sub-RAT123.nwb
  path: /sub-RAT123/sub-RAT123.nwb
...
  type: file

so it will just be a helper function in download to resolve girder id to the path in the asset store, ideally cached.

jwodder commented 3 years ago

Is it really necessary to implement this as a download option? It's not something that any normal user would need, and it would probably be cleaner if the whole thing was just a script that used dandi as a library.

yarikoptic commented 3 years ago

Sure -- could just be an outside script. I just thought it might be simpler to just implement it within dandi-cli as a DANDI_DEVEL option.

jwodder commented 3 years ago

Is there a recommended way to get a list of all Dandiset IDs? I know I can use Girder's /dandi endpoint, but it lacks decent pagination support, and I'm not sure if one of the other API components has a better endpoint.

Also, should there be some sort of handling of Dandiset versions?

jwodder commented 3 years ago

Problem: The Python on drogon is 3.5, yet this library requires 3.6.

yarikoptic commented 3 years ago

Please just install miniconda in your HOME with any suitable Python.

re list of dandisets:

In [11]: from dandi import girder                                                                                

In [12]: cl = girder.get_client("https://girder.dandiarchive.org")                                               

In [13]: [r['name'] for r in cl.listFolder("5e59bb0af19e820ab6ea6c62", parentFolderType='collection')]           
Out[13]: 
['000003',
 '000004',
 '000005',
 '000006',
 '000007',
 '000008',
 '000009',
 '000010',
 '000011',
 '000012',
 '000013',
 '000015',
...

yarikoptic commented 3 years ago

re miniconda: cut pasteable example from http://handbook.datalad.org/en/latest/intro/installation.html?highlight=hpc#linux-machines-with-no-root-access-e-g-hpc-systems

$ wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
$ bash Miniconda3-latest-Linux-x86_64.sh
# acknowledge license, keep everything at default
$ conda install -c conda-forge datalad

jwodder commented 3 years ago

The asset for 000020/sub-746220081/sub-746220081_ses-751268829_icephys.nwb (which should be at girder-assetstore/47/dc/47dc1e65b27441f28d5e7d9cf1109c12) appears to be missing from the backup for some reason, making the script fail. Should failures of the cp command just be ignored?

yarikoptic commented 3 years ago

hm... for now please add an option for that -- for the "investigate metadata compliance" it is ok to miss a few files. BUT eventually we need to figure it out. (backup runs daily, so it might be that there were some changes to that dandiset today? will check later)

jwodder commented 3 years ago

The script has finished running, taking 20 minutes and 22 seconds. Should I commit it to a top-level tools/ directory in this repository or place it somewhere else?

yarikoptic commented 3 years ago

awesome! yes please - commit it under tools/.

Due to the https://github.com/dandi/dandiarchive/issues/491 though we are lacking dandiset.yaml in each one of those. Could you please adjust the script to use dandi download --download dandiset.yaml to instantiate all of them so we get them "more complete"?

jwodder commented 3 years ago

Done. Pull request: https://github.com/dandi/dandi-cli/pull/244

dandi / dandi-cli

"instantiate" dandisets from the "backup" #243