Add support for .zarr "files" upload

yarikoptic commented 2 years ago

zarr upload support was added to staging in https://github.com/dandi/dandi-api/pull/586
@dchiquito on slack:
- The staging swagger docs now have a zarr section (https://api-staging.dandiarchive.org/swagger/) and https://github.com/dandi/dandi-api/blob/master/doc/design/zarr-support-3.md#upload-flow describes the intended flow.
- https://github.com/dandi/dandi-api/blob/master/dandiapi/api/tests/test_zarr_upload.py could be used as example implementations
I think we can for starters assume that a directory with .zarr extension is a zarr archive
- @satra -- is there a better (and a reliable) indicator to judge that some directory is a zarr "file"? I guess we could look for some .zarray or .zgroup but I wonder if that would be reliable enough?

jwodder commented 2 years ago

@yarikoptic In order to show only a single entry in pyout when downloading a Zarr, I think the code would need to call _download_file() in a separate thread for each file in the Zarr and then interleave & combine the results. Assuming I can figure out a good way to do that, the biggest problem I can think of would be how we would limit the maximum number of concurrent download threads, which is currently managed by pyout. Unless there's some way to get pyout to do the entry-combination for us?

yarikoptic commented 2 years ago

Can _download_file() in pyout'ed thread, when handling a Zarr just start its own pool of threads for _download_file() ing individual files, and then report back to pyout an aggregate progress?

yarikoptic commented 2 years ago

Attn @satra on above https://github.com/dandi/dandi-cli/issues/852#issuecomment-1012308403 question about empty directories. Smells like we should ignore them and exclude from checksum compute

satra commented 2 years ago

Smells like we should ignore them and exclude from checksum compute

i would agree with that.

the only thing is that we should perhaps test what an empty array (https://zarr.readthedocs.io/en/stable/api/creation.html#zarr.creation.empty) looks like in a nested directory store. and whether zarr's checksum of a zarr dataset changes or not.

jwodder commented 2 years ago

@yarikoptic

Can _download_file() in pyout'ed thread, when handling a Zarr just start its own pool of threads for _download_file() ing individual files, and then report back to pyout an aggregate progress?

That's the basic plan, but the problem is determining the size of the thread pool. Normally, on a quad-core computer, pyout only runs 8 threads at once, so there are only at most 8 concurrent downloads. If we set the size of the Zarr thread pools to some n, we could end up with 8n concurrent downloads — and if we try to avoid that by only downloading one file in a Zarr at a time, we may end up taking longer than normal to download everything if there's ever a point where fewer than 8 downloads are running in total.

satra commented 2 years ago

@jwodder - for downloads and uploads isn't the bottleneck in disk and network bandwidth much more than number of cores? could something be done to auto adjust the number of processes ?

also when dealing with lee's data i found pyout too constraining by default, especially because the number of jobs it runs is limited by the display of the table. it would be nice if there was an option to not care about a tabular display but just a summary display. it does seem that decoupling download processes from display constraints would be a good step, and may provide a greater ability to optimize it.

jwodder commented 2 years ago

@satra It is indeed possible to configure the maximum number of workers pyout uses when instantiating a table. How exactly do you recommend we determine this value?

jwodder commented 2 years ago

@satra Also, regarding the number of jobs in pyout being limited by the display of the table: It seems that, by default, pyout waits for the top three jobs to complete before adding a new row that would cause the topmost to go off-screen, though this can be configured or disabled.

satra commented 2 years ago

@jwodder - i don't a good answer on how to optimize, but perhaps a test download would tell you how many threads optimizes bandwidth given disk constraints (keep increasing the number of threads/processes till it saturates). on clusters where disks may be on different interfaces, they could even be competing with download bandwidth. it may be worthwhile looking up if someone has implemented an algorithm for this.

regarding pyout, the issue i ran into was when one of the rows near the top was uploading (say a large file) and everything else was complete, it would not add any more jobs. if the bandwidth was being fully utilized i don't care too much, but in some cases this result in less efficient upload. the other use case is when a lot of small files are being uploaded then the job limit was 10 and could not be increased even though bandwidth wasn't an issue and devices weren't being saturated. hence the notion of removing the tabular display altogether and simply providing the summary display.

between optimizing number of threads and removing tabular display, i would prioritize the latter first so that the number of uploads/downloads do not have to be limited to 10.

yarikoptic commented 2 years ago

re jobs -- I think we should adopt what we have for upload already:

  -J, --jobs N[:M]                Number of files to upload in parallel and,
                                  optionally, number of upload threads per
                                  file

so user could explicitly control on how many threads per file (if possible, like in case of zarr) it would be. Default to e.g. 4 and be "done".

In the longer run, we are still to RF all the interfacing and pyout should not take care then about parallelizing and only for reporting/UI.

jwodder commented 2 years ago

@yarikoptic So you're suggesting adding a --jobs option to dandi download for separately setting the number of "main" download threads and the number of threads per Zarr asset, then?

yarikoptic commented 2 years ago

yes, that is what I am thinking about. In principle we could even thread downloads of regular files via multiple range requests but I do not see much need in that, so we could just comment that [:M] is in effect only for assets containing multiple files (such as .zarr directories)

jwodder commented 2 years ago

@yarikoptic I've written a library for running multiple iterators in threads and collecting the results. Now comes the bikeshedding part: deciding what status dicts should be emitted for a combined set of status dicts.

A refresher: The columns shown in the download pyout table are "path", "size", "done", "done%", "checksum", "status", and "message". The status dicts yielded for a given path while it's downloading take the following forms, yielded roughly in the order shown:

{"status": "skipped", "message": "<MESSAGE>"}
{"size": <int>}
{"status": "downloading"}
{"done": <bytes downloaded>, "done%": <percentage downloaded, from 0 to 100>}
{"status": "error", "message": "<MESSAGE>"}
{"checksum": "differs", "status": "error", "message": "<MESSAGE>"}
{"checksum": "ok"}
{"status": "setting mtime"}
{"status": "done"}

Hence the following questions & thoughts:

Should the "message" for a combined download always be in the form "X downloading, Y errored, Z skipped"? If so, what should happen to the error messages for individual files?
When at least one file is downloading and at least one file has errored, should the combined status be "downloading" or "error"? Likewise when skipped files are added to the mix.
When a download errors, should its contribution to the combined download progress be removed, left where it was, or treated like 100%?
{"status": "setting mtime"} status dict should probably just be discarded on input and not reported in the combined output.
It'll be simplest if the combined "size" and "done%" only reflect files seen so far rather than reporting the total Zarr size from the start.
Should the combined "checksum" field be "ok" as long as at least one downloaded file has its checksum verified, or should that only be shown once all checksums are verified?

yarikoptic commented 2 years ago

Should the "message" for a combined download always be in the form "X downloading, Y errored, Z skipped"? If so, what should happen to the error messages for individual files?

Let's make it "X done, Y errored, Z skipped" with any of those omitted whenever it is just 0.

When at least one file is downloading and at least one file has errored, should the combined status be "downloading" or "error"? Likewise when skipped files are added to the mix.

I would keep it "downloading" until we are done with all files, and then if some failed, yield "error" instead of "done". (note: if any errorred, no need to do checksum I guess)

When a download errors, should its contribution to the combined download progress be removed, left where it was, or treated like 100%?

my vote would be for "removed" so user would get adequate "done%" as for how much of zarr was downloaded.

{"status": "setting mtime"} status dict should probably just be discarded on input and not reported in the combined output.

we will be setting it for every file, right? then indeed now worth yielding, unless may be for the .zarr/ directory at the end of the process.

It'll be simplest if the combined "size" and "done%" only reflect files seen so far rather than reporting the total Zarr size from the start.

"size" should report total zarr (as reported by the server); "done" and "done%" - seen so far.

Should the combined "checksum" field be "ok" as long as at least one downloaded file has its checksum verified, or should that only be shown once all checksums are verified?

I think checksum should be reported once for the entire zarr at the end of the download process.

jwodder commented 2 years ago

@yarikoptic I don't think the code should calculate a checksum for the whole Zarr after downloading; digests are already checked for the individual files, and that should be enough. Moreover, checksumming a Zarr would fail if downloading into a pre-existing local Zarr containing files that don't exist in the Zarr being downloaded.

yarikoptic commented 2 years ago

Moreover, checksumming a Zarr would fail if downloading into a pre-existing local Zarr containing files that don't exist in the Zarr being downloaded.

I think it is actually important enough to guarantee consistency of zarr as downloaded to match what is on the server to avoid confusion and e.g. reupload of a zarr with some garbage in it. So, as the last step in the zarr download, it would be good to delete files which are not listed/present in the remote (on server) copy. And then checksum should match. I guess we can rely on fscacher to cache individual files checksumming to make it reasonably fast.

jwodder commented 2 years ago

@yarikoptic Hashing of downloaded files is performed piecemeal as they're being downloaded without using fscacher, so checksumming a downloaded Zarr will have to start from scratch.

yarikoptic commented 2 years ago

@yarikoptic Hashing of downloaded files is performed piecemeal as they're being downloaded without using fscacher, so checksumming a downloaded Zarr will have to start from scratch.

so for downloaded in current session we would have md5s from "on the fly" digestion right? for those others found in the folder we would need to redigest them from disk. And that is where I thought fscacher would kick in. Or where was I wrong?

jwodder commented 2 years ago

@yarikoptic

so for downloaded in current session we would have md5s from "on the fly" digestion right?

The digests aren't returned from the download manager; they're just discarded after they're checked.

yarikoptic commented 2 years ago

The digests aren't returned from the download manager; they're just discarded after they're checked.

that's unfortunate. Didn't check the code, but may be you see a reasonably easy way to still make it possible to get their digests for overall zarr digest computation "on the fly"? if not -- we will be doomed to re-digest the entire zarr tree in full upon completion.

jwodder commented 2 years ago

@yarikoptic I figured out a way.

satra commented 2 years ago

i will copy the directory over and retry, but perhaps symbolic links should be supported.

$ dandi upload -i dandi-staging sub-MITU01_run-1_sample-178_stain-LEC_chunk-1_spim.ngff
2022-01-21 08:20:05,787 [ WARNING] A newer version (0.34.1) of dandi/dandi-cli is available. You are using 0.34.0+50.gc12f72b
2022-01-21 08:20:33,114 [ WARNING] /mnt/beegfs/satra/zarrhdf/101233/sub-MITU01/ses-20210521h17m17s06/microscopy/sub-MITU01_run-1_sample-178_stain-LEC_chunk-1_spim.ngff: Ignoring unsupported symbolic link to directory
2022-01-21 08:20:33,114 [    INFO] Found 0 files to consider
2022-01-21 08:20:33,126 [    INFO] Logs saved in /home/satra/.cache/dandi-cli/log/20220121132004Z-1470910.log

dandi / dandi-cli

Add support for .zarr "files" upload #852