Closed yarikoptic closed 8 months ago
@yarikoptic
how much did you play with webdav?
None.
i would add a check in this issue for checking whether webdav can be done without egress data bytes through a server.
i would add a check in this issue for checking whether webdav can be done without egress data bytes through a server.
I think the best would be to start with an independent service to just try feasibility etc, using https://github.com/mar10/wsgidav and make it for now so we could just try locally and if it would work nicely then we reassess inclusion into dandi-archive.
@jwodder could you please provide a prototype imlementation using https://github.com/mar10/wsgidav and looking at available backends e.g. for mercurial: https://wsgidav.readthedocs.io/en/latest/addons-mercurial.html and others, so we establish webdav over api.dandiarchive.org to expose it as a read-only webdav where it would have a tree
dandisets/{dandiset_id}
with following possible folders under
draft/
- always there, has a tree for current draft versionlatest/
- present if there was a released version, would have a tree of the most recent versionreleases/
with {version}
subfolders - if there were releasesfor individual files, should forward to the s3 http URL.
@yarikoptic Where should this wsgidav instance be deployed?
for now I have no ready for it hosting -- should be code/service anyone could just run/try locally. If we see that it works well, we would then either proceed adapting it within our dandi-archive instance(s) or look into establishing a separate hosting for it.
@yarikoptic
dandisets/
path prefix? Is the root of the WebDAV service ever going to contain any entries other than dandisets/
?@yarikoptic
- I'm assuming you want the WebDAV view to be read-only; is that correct?
For this portion - yes! Depending on our success with it in read-only facility, we might (much later) want to look into providing support for uploading stuff too through it -- might be quite cool / user-friendly (as long as we can provide feedback etc).
- How exactly should assets under a given version be laid out? Should there be a flat listing of all assets in the version, or should the assets be grouped into the directory hierarchy implied by the forward slashes in their paths?
not flat -- should be according to their directory hierarchy, like we have in datalad dandisets and files view on dandiarchive.
- How should Zarr assets be represented? Should they just be directories of entries or something else?
my webdav knowledge is very limited... AFAIK "directory" in webdav is a collection as well, or in other words I do not know a way to have some different "types" of collections. So as such -- zarr assets indeed should be just directories, and then paths under them should be redirected to corresponding paths under corresponding zarr on S3.
- What's the point of the
dandisets/
path prefix? Is the root of the WebDAV service ever going to contain any entries other thandandisets/
?
I thought of it indeed as some kind of future proofing since we do have separate "prefix/"es on S3, and indeed if there would be demand, we might want to expose later zarrs/
or metadata/
or some other elements.
@yarikoptic
paths under them should be redirected to corresponding paths under corresponding zarr on S3.
What exactly do you mean by this? wsgidav doesn't provide any functionality for redirects. When a user navigates to a Zarr in a WebDAV client, do you want them to see a listing of the entries within the Zarr (possibly organized into a directory hierarchy) or something else?
@yarikoptic
paths under them should be redirected to corresponding paths under corresponding zarr on S3.
What exactly do you mean by this? wsgidav doesn't provide any functionality for redirects.
hm,
When a user navigates to a Zarr in a WebDAV client, do you want them to see a listing of the entries within the Zarr (possibly organized into a directory hierarchy) or something else?
yes -- entries within Zarr organized into a directory hierarchy -- pretty much 1-to-1 as to how it is on S3. Overall example with what redirects/responses we need on an example
dandisets/000108/draft/
-> index of files and folders as in https://github.com/dandisets/000108dandisets/000108/draft/dandiset.yaml
-> yaml formatted output of GET of https://api.dandiarchive.org/api/dandisets/000108/versions/draft/dandisets/000108/draft/dataset_description.json
-> 302 redirect to https://dandiarchive.s3.amazonaws.com/blobs/c07/71a/c0771a4f-3483-47e7-821e-b28ac8df46a5dandisets/000108/draft/sub-MITU01/ses-20210521h17m17s06/micr/sub-MITU01_ses-20210521h17m17s06_sample-178_stain-NN_run-1_chunk-1_SPIM.ome.zarr/
-> index from s3://dandiarchive/zarr/8b0493dd-32d6-4f75-8ca2-b57b28fc9695/dandisets/000108/draft/sub-MITU01/ses-20210521h17m17s06/micr/sub-MITU01_ses-20210521h17m17s06_sample-178_stain-NN_run-1_chunk-1_SPIM.ome.zarr/0/.zarray
-> 302 redirect to https://dandiarchive.s3.amazonaws.com/zarr/8b0493dd-32d6-4f75-8ca2-b57b28fc9695/0/.zarray@yarikoptic The WebDAV protocol is fine with redirects (See, e.g., RFC 4437), but the wsgidav implementation does not seem to support defining redirects.
@yarikoptic Note that organizing assets and Zarr entries into directory hierarchies is not going to be trivial:
@yarikoptic The WebDAV protocol is fine with redirects (See, e.g., RFC 4437),
phewph, good!
but the wsgidav implementation does not seem to support defining redirects.
:-/ is it possible to just reply with some standard HTTP response there?
@yarikoptic Note that organizing assets and Zarr entries into directory hierarchies is not going to be trivial:
- While the Archive has an endpoint that groups assets by directory, dandi-cli currently does not support it.
even without adding to dandi-cli, is it much more just some client.paginate
request directly to API?
- The Archive currently does not group Zarr entries by directory; support for doing so was removed in Use flat file listing in zarr file browser dandi-archive#1394.
since embargoed zarrs are not even supported yet, everything is public, let's just use boto directly to get a listing of the "index" for S3 prefix.
FWIW, for redirects there was a fresh followup https://github.com/mar10/wsgidav/issues/303#issuecomment-1854559322 confirming that not directly supported but possibly relatively easy to add to test the idea out
@yarikoptic
is it possible to just reply with some standard HTTP response there?
Custom providers have no control over HTTP responses. The library just calls them to get lists of resources and their properties.
let's just use boto directly to get a listing of the "index" for S3 prefix.
I don't believe boto or S3 have any concept of directories. File paths are just opaque keys, and any slashes the user puts in them are purely a way for the user to organize things.
@yarikoptic
is it possible to just reply with some standard HTTP response there?
Custom providers have no control over HTTP responses. The library just calls them to get lists of resources and their properties.
what about the overload approach mentioned in https://github.com/mar10/wsgidav/issues/303#issuecomment-1854559322 ?
let's just use boto directly to get a listing of the "index" for S3 prefix.
I don't believe boto or S3 have any concept of directories. File paths are just opaque keys, and any slashes the user puts in them are purely a way for the user to organize things.
nope. I did listing of directories in datalad (now in datalad-deprecated) with old boto and you can quickly (takes no time -- less than a sec) do that e.g. with
❯ time s3cmd -c ~/.s3cfg-dandi-backup ls s3://dandiarchive/zarr/8b0493dd-32d6-4f75-8ca2-b57b28fc9695/
DIR s3://dandiarchive/zarr/8b0493dd-32d6-4f75-8ca2-b57b28fc9695/0/
DIR s3://dandiarchive/zarr/8b0493dd-32d6-4f75-8ca2-b57b28fc9695/1/
DIR s3://dandiarchive/zarr/8b0493dd-32d6-4f75-8ca2-b57b28fc9695/2/
DIR s3://dandiarchive/zarr/8b0493dd-32d6-4f75-8ca2-b57b28fc9695/3/
DIR s3://dandiarchive/zarr/8b0493dd-32d6-4f75-8ca2-b57b28fc9695/4/
DIR s3://dandiarchive/zarr/8b0493dd-32d6-4f75-8ca2-b57b28fc9695/5/
DIR s3://dandiarchive/zarr/8b0493dd-32d6-4f75-8ca2-b57b28fc9695/6/
2022-04-21 23:26 7859 s3://dandiarchive/zarr/8b0493dd-32d6-4f75-8ca2-b57b28fc9695/.zattrs
2022-02-26 22:22 24 s3://dandiarchive/zarr/8b0493dd-32d6-4f75-8ca2-b57b28fc9695/.zgroup
2022-04-21 23:26 14925 s3://dandiarchive/zarr/8b0493dd-32d6-4f75-8ca2-b57b28fc9695/.zmetadata
s3cmd -c ~/.s3cfg-dandi-backup ls 0.10s user 0.02s system 19% cpu 0.572 total
in cli.
@yarikoptic
what about the overload approach mentioned in https://github.com/mar10/wsgidav/issues/303#issuecomment-1854559322 ?
I have no idea how to implement that as a user of wsgidav without forking wsgidav.
update: @jwodder redid in Rust in https://github.com/dandi/dandidav . A sample instance is running at https://dandi.centerforopenneuroscience.org/ (not automatically deployed). Sample external services URLs to try:
Next stage would be deployment:
webdav.dandiarchive.org
within our infrastructure so configuration is centralized etc. the subdomain can be registered through our aws account route53, but that involves us running the service. has forwarding to s3 for retrieval been implemented? if not, i would at least start with integrating at a local level that anyone could run it. if yes, then proceed with the setup.
has forwarding to s3 for retrieval been implemented?
AFAIK yes. @jwodder can confirm if that is generally so. Known exception is dandiset.yaml
(rust knowledgeable folks can review source). Anyways would be nice to also add some traffic/load/requests stats for that node to see how well it copes under load.
@satra @waxlamp I would like to proceed with moving dandidav deployment into the "official" dandiarchive.org space from its temporary https://dandi.centerforopenneuroscience.org/ .
Please guide us with @jwodder through on what we need to do to accomplish the drill.
create a new instance in the aws account or heroku account and add a route53 cname alias for it. ideally, there are a few considerations with respect to horizontal scalability, but before we do that get a basic setup running. also estimate the costs of this service based on the infrastructure you choose. pinging @aaronkanzer who may be able to help with some considerations depending on choices.
perhaps a devops doc could help you and others in the future as to how to deploy new services. note that once we move over k8s to 2i2c, we will want to use that substrate for future services.
am I correct that heroku would be better target since it would hard limit us on resources so we do not break the bank?
you can limit things in aws as well (fixed instance, no load balancer, etc.,.). but it may be quicker/easier in heroku.
It seems there are Rust buildpacks for Heroku. As for infrastructure, I think it would be prudent to manage the necessaries through our Terraform setup.
@mvandenburgh, I think you have the necessary background to look into this and formulate an operations plan. Could you please start with these two questions:
Side note: I'd appreciate it if the deployment procedure preserved enough of the Git history for it to be possible to run git rev-parse --short HEAD
in order to embed the current Git commit in the binary.
may be also @kabilar and @aaronkanzer could help on this end since they are replicating DANDI infrastructure setup ?
I don't have the bandwidth for a few days, but I might suggest doing a proof-of-concept outside of dandi-infrastructure
first alone in Heroku since dandi-infrastructure
is quite coupled with the Girder Terraform submodule (which, fortunately/unfortunately not overridable in Terraform land) -- feel free to correct me @waxlamp @mvandenburgh
@yarikoptic @jwodder perhaps just provisioning a Heroku dyno and including a Procfile
with your Rust app and pushing it to the corresponding Heroku dyno would be good enough? (Just expanding on https://github.com/dandi/dandi-infrastructure/issues/166#issuecomment-1958516864)
Then you could have observability and stress-testing in the short-term in a more production-ish setting -- Heroku fortunately provisions an HTTPS-ready API URL (which eventually we would CNAME in Route53 to webdav.dandiarching.org
) with the dyno. If successful, we could append or easily build out more IaC in dandi-infrastructure
.
Just some thoughts...
I don't have the bandwidth for a few days, but I might suggest doing a proof-of-concept outside of
dandi-infrastructure
first alone in Heroku sincedandi-infrastructure
is quite coupled with the Girder Terraform submodule (which, fortunately/unfortunately not overridable in Terraform land) -- feel free to correct me @waxlamp @mvandenburgh
Thanks @aaronkanzer, I definitely agree it makes sense to do an initial proof of concept outside of Terraform.
@yarikoptic @jwodder perhaps just provisioning a Heroku dyno and including a
Procfile
with your Rust app and pushing it to the corresponding Heroku dyno would be good enough? (Just expanding on #166 (comment))Then you could have observability and stress-testing in the short-term in a more production-ish setting -- Heroku fortunately provisions an HTTPS-ready API URL (which eventually we would CNAME in Route53 to
webdav.dandiarching.org
) with the dyno. If successful, we could append or easily build out more IaC indandi-infrastructure
.
Agreed, I think this is the approach we should take - I'll start out by trying to set this up.
Side note: I'd appreciate it if the deployment procedure preserved enough of the Git history for it to be possible to run
git rev-parse --short HEAD
in order to embed the current Git commit in the binary.
@jwodder Heroku provides these values as environment variables at runtime - https://devcenter.heroku.com/articles/dyno-metadata#dyno-metadata. Is using the HEROKU_SLUG_COMMIT
environment variable sufficient here?
@mvandenburgh I've created a PR to fetch the Git commit from HEROKU_SLUG_COMMIT
if no normal Git information is available: https://github.com/dandi/dandidav/pull/95
To facilitate integration in various external projects (e.g., OSDF with Pelican underneath) which can interface to webdav services. Treat it as unification of our API to a standard API for file access.
TODOs:
/paths
endpoint, but likely not since that one is dandiset specific, but may be it could be at the top level of APImay be it could be implemented as independent service, not part of API.
attn @jwodder -- how much did you play with webdav?
edits:
dandisets/
andzarrs/
(notzarr/
) folders (problem forzarr/
though -- heavy folder listing).{dandiset_id}
.draft/
,released/
(orlatest/
- most recent release),releases/{VERSION}
. It kinda makes it to requiredraft/
or other prefix folder to get to content but it is consistent so I like it most.draft/
,released/
(orlatest/
) and then all versions at the same level. kinda ok, but since numbered releases would most likely sort first and oldest first -- I think it would be not that convenient of a default view...dandi://INSTANCE/DANDISET_ID[@VERSION][/PATH]
and thus incorporating version into the folder name for dandiset -- IMHO ugly