DANDI sync to MIT Engaging

kabilar commented 1 month ago

@puja-trivedi @aaronkanzer and I are using this issue to track our work to sync the DANDI public bucket to MIT Engaging.

Requirements

Possible solutions

Globus collections
scp
rsync
s5cmd
dandi-cli

Open questions

How will we handle MFA for programmatic access?
Basic Globus endpoints will transfer data unencrypted
Globus AWS S3 connector would need to be added to our subscription to access data on S3. Is this add-on currently included in the MIT Globus subscription?

Resources

satra commented 1 month ago

s5cmd and the dandi cli should also be solutions, no? they both offer parallelism that scp and rsync don't.

kabilar commented 1 month ago

Thanks, Satra. Yes, definitely. Exploring s5cmd next.

kabilar commented 1 month ago

@puja-trivedi @aaronkanzer s5cmd installation on Engaging got stuck again. Will need to try a different mechanism.

satra commented 1 month ago

@kabilar - s5cmd has prebuilt binaries and also conda. how are you installing on engaging?

kabilar commented 1 month ago

Hi @satra, we tried conda but it wasn't able to resolve the download. Will try the binaries.

yarikoptic commented 1 month ago

Didn't look into globus one, but all given the size of our bucket (in number of keys) others afaik wouldn't be sufficient for efficient incremental backups. FTR

https://github.com/peak/s5cmd/issues/746

We would need a tool which would make use of that extra service @satra mentioned (can't recall name) we have enabled which tracks changes to our s3. Before initiating full backup might be worth first deciding how incremental to be done so that initial backup would be done with future incrementals in mind (eg may be capturing the state/position in that extra service)

kabilar commented 1 month ago

Thanks, Yarik. That sounds good.

kabilar commented 1 month ago

From MIT ORCD team:

We do have an S3 license included in our Globus subscription and are looking into how we might set it up. It may require some admin things on our end. I'll keep you posted with updates.

kabilar commented 1 month ago

@puja-trivedi For reference, DANDI design docs.

yarikoptic commented 4 weeks ago

FTR (if someone would inquire on "scales") -- 000108 alone, though the zarrs in it, points to 332,739,854 keys on S3 according to its description on https://github.com/dandisets/000108 . If MIT has Globus subscription, could someone inquire from Globus on either they have any factual data (benchmarks, use-cases) for S3 connector to be used with hundreds of millions of keys in a bucket for incremental backup?

note: not sure we are already there or not, but even "built in" AWS S3 Backup service seems to have a limitation of "The AWS Backup can be used only for buckets with less than 3 billion objects" ref

Meanwhile, @satra, who/where we have access to S3 inventory associated with our sponsored bucket? (FWIW, I insofar failed to use pre-cooked tool/script which would make use of inventory for backups, odd)

satra commented 4 weeks ago

who/where we have access to S3 inventory associated with our sponsored bucket?

it's in the sponsored bucket, so whoever has keys to that (it's dumped into a specific directory there that is only readable using the appropriate access keys).

yarikoptic commented 4 weeks ago

I guess I might not have an appropriate access key since I see only

$ s3cmd ls -l s3://dandiarchive/
                          DIR                                                    s3://dandiarchive/blobs/
                          DIR                                                    s3://dandiarchive/dandiarchive/
                          DIR                                                    s3://dandiarchive/dandisets/
                          DIR                                                    s3://dandiarchive/zarr-checksums/
                          DIR                                                    s3://dandiarchive/zarr/
2021-09-22 22:20         2137  99d1fd07269359b636b34bd402c58fbc     STANDARD     s3://dandiarchive/README.md
2021-09-22 22:20         3094  1b484c3b547a89efd67da353397556a4     STANDARD     s3://dandiarchive/index.html
2021-01-29 22:07         4008  ef4867d3c21a0034a98cd9453f14efe3     STANDARD     s3://dandiarchive/ros3test.hdf5
2021-08-12 00:48       177728  35574be1cdfe3ae4c4235d34d7348f99     STANDARD     s3://dandiarchive/ros3test.nwb

?

satra commented 4 weeks ago

should be inside: s3://dandiarchive/dandiarchive/

yarikoptic commented 4 weeks ago

interesting! So s3cmd gives me "empty directory" (although there is no directories on S3), which is different from 'non-existing':

$ s3cmd ls -l s3://dandiarchive/dandiarchive/
                          DIR                                                    s3://dandiarchive/dandiarchive/dandiarchive/
$ s3cmd ls -l s3://dandiarchive/dandiarchives/
$

I guess I do not have access to the keys under it, but they are there.

satra commented 4 weeks ago

perhaps check through web account, so you log in with the credentials for that account.

dandi / dandi-infrastructure

DANDI sync to MIT Engaging #189