Open yarikoptic opened 1 month ago
@yarikoptic Problem: wget
's "recursive" mode is limited to a maximum depth of 5 directories by default. Possible ways to address this are:
Include --level=inf
in the displayed wget
commands to disable the maximum depth
--no-parent
option, which I think would result in wget
trying to download everything listed on dandidav.Determine the maximum depth of the hierarchy the message is displayed for and use that number as the --level
value
Don't include a --level
option in the displayed wget
commands
wget
commands not fetching everything.Pick a relatively large fixed depth (10? 20?) and use that as the --level
value in all displayed wget
commands
@yarikoptic Further problems:
Because the actual files are on different domains, wget
downloads them under different directory hierarchies, and there doesn't seem to be an option to place them "together".
When downloading a Dandiset version or folder therein, asset metadata also gets downloaded (because there's a link to such metadata in the web view), and the --reject "index.html*"
option needed to not save directory listings also results in the metadata being deleted after it's downloaded, leaving behind a tree of empty directories. There may be a way to prevent this with the --exclude-directories
option, but I can't get it to work for this.
At the moment, my best wget
command is:
wget \
--recursive \
--span-hosts \
--domains=webdav.dandiarchive.org,api.dandiarchive.org \
--no-parent \
--content-disposition \
--reject "index.html*" \
https://webdav.dandiarchive.org/dandisets/000027/releases/0.210831.2033/
which downloads:
./
├── api.dandiarchive.org/
│ └── api/
│ └── dandisets/
│ └── 000027/
│ └── versions/
│ └── 0.210831.2033/
│ └── assets/
│ └── 1c095f5f-d1e2-45db-b807-fdcfea08c6de/
├── dandiarchive.s3.amazonaws.com/
│ └── blobs/
│ └── 2db/
│ └── af0/
│ └── sub-RAT123.nwb
└── webdav.dandiarchive.org/
└── dandisets/
└── 000027/
└── releases/
└── 0.210831.2033/
├── dandiset.yaml
└── sub-RAT123/
@yarikoptic Problem:
wget
's "recursive" mode is limited to a maximum depth of 5 directories by default.
I had no idea! I think we are doomed to add/use --level=inf
since we never really cared about recording/reflecting anywhere the depth of the zarr*
. Indeed --no-parent
would be mandatory and thus better be "near" in the line. We could also add --quota
with e.g. 101% of zarr size but not sure if good idea and either adds any level of protection really.
*
in a hindside might have suggested to be included in checksum but likely would be "too much" . Do you think it would be useful to discuss this aspect?
Actually -- we are in control of manifest generation, we can extract/include that info in the manifest!
@yarikoptic
we are in control of manifest generation, we can extract/include that info in the manifest!
I got the impression you wanted this for Dandisets and folders within them as well, not just Zarrs.
@yarikoptic
we are in control of manifest generation, we can extract/include that info in the manifest!
I got the impression you wanted this for Dandisets and folders within them as well, not just Zarrs.
right, I wanted indeed... for those we are indeed doomed to just hope for the --no-parent
to work out and wget
not crawling away from the original hierarchy.
@yarikoptic I did manage to figure out an rclone
command to download a folder nicely:
rclone copy \
--webdav-url https://webdav.dandiarchive.org \
:webdav:dandisets/000027/releases/0.210831.2033/ \
0.210831.2033/
Should we use this instead of wget
? Are there any other download commands we should list or consider listing in addition or instead?
@yarikoptic Ping.
Depending on how we present it -- we might want may be both? e.g. if it could be multiple tabs (wget, rclone, dandi cli
, and may be even python etc) -- then people could choose what they have/like etc. I didn't look if there is a simple HTML/CSS/JS way though to make that happen. WDYT?
@yarikoptic Worrying about how the data is presented is getting ahead of ourselves and ultimately not that important. I'm currently interested in what data should be presented.
Then let's present both -- ugly wget
and neater webdav aware rclone
.
prompted by @jwodder in https://github.com/dandi/dandi-archive/issues/1993#issuecomment-2273593108 it would be a nice UX , similarly to how we have on https://datasets.datalad.org/ informing user about
datalad install
instructions, here we could providewget
invocation to download entire zarr, or otherwise specific dandiset or its folder. We also have alreadywhich similarly suggests integration with external services to instruct users on how to interact with particular files or zarrs.