Open kabilar opened 1 month ago
s5cmd and the dandi cli should also be solutions, no? they both offer parallelism that scp and rsync don't.
Thanks, Satra. Yes, definitely. Exploring s5cmd next.
@puja-trivedi @aaronkanzer s5cmd
installation on Engaging got stuck again. Will need to try a different mechanism.
@kabilar - s5cmd has prebuilt binaries and also conda. how are you installing on engaging?
Hi @satra, we tried conda but it wasn't able to resolve the download. Will try the binaries.
Didn't look into globus one, but all given the size of our bucket (in number of keys) others afaik wouldn't be sufficient for efficient incremental backups. FTR
We would need a tool which would make use of that extra service @satra mentioned (can't recall name) we have enabled which tracks changes to our s3. Before initiating full backup might be worth first deciding how incremental to be done so that initial backup would be done with future incrementals in mind (eg may be capturing the state/position in that extra service)
Thanks, Yarik. That sounds good.
From MIT ORCD team:
We do have an S3 license included in our Globus subscription and are looking into how we might set it up. It may require some admin things on our end. I'll keep you posted with updates.
@puja-trivedi For reference, DANDI design docs.
FTR (if someone would inquire on "scales") -- 000108 alone, though the zarrs in it, points to 332,739,854 keys on S3 according to its description on https://github.com/dandisets/000108 . If MIT has Globus subscription, could someone inquire from Globus on either they have any factual data (benchmarks, use-cases) for S3 connector to be used with hundreds of millions of keys in a bucket for incremental backup?
Meanwhile, @satra, who/where we have access to S3 inventory associated with our sponsored bucket? (FWIW, I insofar failed to use pre-cooked tool/script which would make use of inventory for backups, odd)
who/where we have access to S3 inventory associated with our sponsored bucket?
it's in the sponsored bucket, so whoever has keys to that (it's dumped into a specific directory there that is only readable using the appropriate access keys).
I guess I might not have an appropriate access key since I see only
$ s3cmd ls -l s3://dandiarchive/
DIR s3://dandiarchive/blobs/
DIR s3://dandiarchive/dandiarchive/
DIR s3://dandiarchive/dandisets/
DIR s3://dandiarchive/zarr-checksums/
DIR s3://dandiarchive/zarr/
2021-09-22 22:20 2137 99d1fd07269359b636b34bd402c58fbc STANDARD s3://dandiarchive/README.md
2021-09-22 22:20 3094 1b484c3b547a89efd67da353397556a4 STANDARD s3://dandiarchive/index.html
2021-01-29 22:07 4008 ef4867d3c21a0034a98cd9453f14efe3 STANDARD s3://dandiarchive/ros3test.hdf5
2021-08-12 00:48 177728 35574be1cdfe3ae4c4235d34d7348f99 STANDARD s3://dandiarchive/ros3test.nwb
?
should be inside: s3://dandiarchive/dandiarchive/
interesting! So s3cmd gives me "empty directory" (although there is no directories on S3), which is different from 'non-existing':
$ s3cmd ls -l s3://dandiarchive/dandiarchive/
DIR s3://dandiarchive/dandiarchive/dandiarchive/
$ s3cmd ls -l s3://dandiarchive/dandiarchives/
$
I guess I do not have access to the keys under it, but they are there.
perhaps check through web account, so you log in with the credentials for that account.
@puja-trivedi @aaronkanzer and I are using this issue to track our work to sync the DANDI public bucket to MIT Engaging.
Requirements
Possible solutions
Open questions
Resources