Expose dandiarchive as webdav service

yarikoptic commented 11 months ago

To facilitate integration in various external projects (e.g., OSDF with Pelican underneath) which can interface to webdav services. Treat it as unification of our API to a standard API for file access.

TODOs:

[x] decide on how to map our versions for dandisets within the path or some webdav supported mechanism for versioning (if there is any)
[x] double-check if webdav supports "redirection" so we do not need to tunnel all bytes from S3 through the webdav service but rather just redirect.
[x] implement mapping/endpoint: I wonder even if it somehow could be just the /paths endpoint, but likely not since that one is dandiset specific, but may be it could be at the top level of API
[x] decide on implementation
- could be internal to dandi-archive. pros: more efficient since would operate directly on DB/within Python, likely minimizing flood of the logs; it would come "included" by default with any dandi-archive installation; would not require another web server to provide it; as our API still in 0.x and adds breakages from time to time -- would be easier to keep consistent; cons: it would be easier/faster to develop/fix as independent service likely
- could be some additional external service. pros: faster/easier to develop; would stress test our API more to see what it is missing; cons: solution would need to provide yet another webserver
[ ] deploy as webdav.dandiarchive.org or at some other end-point (discussion below)

may be it could be implemented as independent service, not part of API.

attn @jwodder -- how much did you play with webdav?

edits:

https://github.com/mar10/wsgidav looks very promising and already has (non-production) backends for various things, e.g. for mercurial: https://wsgidav.readthedocs.io/en/latest/addons-mercurial.html .
re structure - I think we at large should mimic our S3, just adding a little more of consistency and have dandisets/ and zarrs/ (not zarr/) folders (problem for zarr/ though -- heavy folder listing).
on the best way to expose versions, possibilities, assuming that we do have some global prefix which ends with {dandiset_id}.
- following mercurial example above -- have draft/, released/ (or latest/ - most recent release), releases/{VERSION}. It kinda makes it to require draft/ or other prefix folder to get to content but it is consistent so I like it most.
- have draft/, released/ (or latest/) and then all versions at the same level. kinda ok, but since numbered releases would most likely sort first and oldest first -- I think it would be not that convenient of a default view...
- "smarty pants" version -- kinda making versioning optional and reacting to the regex for version at the top level and if not - giving the tree of the draft or most recent version... too smarty - don't like it
- following our URL schema dandi://INSTANCE/DANDISET_ID[@VERSION][/PATH] and thus incorporating version into the folder name for dandiset -- IMHO ugly
FWIW: filed https://github.com/mar10/wsgidav/issues/303

jwodder commented 11 months ago

@yarikoptic

how much did you play with webdav?

None.

satra commented 11 months ago

i would add a check in this issue for checking whether webdav can be done without egress data bytes through a server.

yarikoptic commented 11 months ago

i would add a check in this issue for checking whether webdav can be done without egress data bytes through a server.

I added a check: and here is a reply from chatgpt (didn't try yet)

Yes, you can implement a WebDAV backend that does not store or provide the actual file bytes but instead redirects to a target URL elsewhere when a client requests a file. This can be achieved by customizing the behavior of the WebDAV server to handle requests in a way that serves external resources through redirection. Here's a simplified example of how you can create a WebDAV server that redirects requests to external URLs using the `PyWebDAV` library in Python: ```python from pywebdav.server import DAVServer from pywebdav.types import DAVError class RedirectDAVResource(DAVResource): def __init__(self, path, redirect_url): super().__init__(path) self.redirect_url = redirect_url def GET(self): raise DAVError(302, self.redirect_url) class MyDAVServer(DAVServer): def __init__(self, root_path): super().__init__() self.root_path = root_path def get_resource_inst(self, path): # In this example, we return a RedirectDAVResource for all files # You can add logic here to determine if a file should be redirected return RedirectDAVResource(path, "https://example.com/external/resource") if __name__ == '__main__': server = MyDAVServer('/path/to/your/data/store') server.run() ``` In this example, we define a custom `RedirectDAVResource` class that inherits from `DAVResource`. When a `GET` request is made to this resource, it raises a `302` HTTP status code with a `Location` header set to the desired redirection URL. The `get_resource_inst` method of the `MyDAVServer` class returns instances of `RedirectDAVResource` for all requested files, but you can customize this logic to decide which files should be redirected and specify the target URL accordingly. Please note that this is a basic example, and you can expand on it to meet your specific requirements for redirection, access control, and handling different types of resources. Depending on your use case, you might also want to handle other WebDAV methods (e.g., PUT, DELETE) as needed.

yarikoptic commented 11 months ago

I think the best would be to start with an independent service to just try feasibility etc, using https://github.com/mar10/wsgidav and make it for now so we could just try locally and if it would work nicely then we reassess inclusion into dandi-archive.

@jwodder could you please provide a prototype imlementation using https://github.com/mar10/wsgidav and looking at available backends e.g. for mercurial: https://wsgidav.readthedocs.io/en/latest/addons-mercurial.html and others, so we establish webdav over api.dandiarchive.org to expose it as a read-only webdav where it would have a tree

dandisets/{dandiset_id} with following possible folders under
- draft/ - always there, has a tree for current draft version
- latest/ - present if there was a released version, would have a tree of the most recent version
- releases/ with {version} subfolders - if there were releases

for individual files, should forward to the s3 http URL.

jwodder commented 11 months ago

@yarikoptic Where should this wsgidav instance be deployed?

yarikoptic commented 11 months ago

for now I have no ready for it hosting -- should be code/service anyone could just run/try locally. If we see that it works well, we would then either proceed adapting it within our dandi-archive instance(s) or look into establishing a separate hosting for it.

jwodder commented 11 months ago

@yarikoptic

I'm assuming you want the WebDAV view to be read-only; is that correct?
How exactly should assets under a given version be laid out? Should there be a flat listing of all assets in the version, or should the assets be grouped into the directory hierarchy implied by the forward slashes in their paths?
How should Zarr assets be represented? Should they just be directories of entries or something else?
What's the point of the dandisets/ path prefix? Is the root of the WebDAV service ever going to contain any entries other than dandisets/?

yarikoptic commented 11 months ago

@yarikoptic

I'm assuming you want the WebDAV view to be read-only; is that correct?

For this portion - yes! Depending on our success with it in read-only facility, we might (much later) want to look into providing support for uploading stuff too through it -- might be quite cool / user-friendly (as long as we can provide feedback etc).

How exactly should assets under a given version be laid out? Should there be a flat listing of all assets in the version, or should the assets be grouped into the directory hierarchy implied by the forward slashes in their paths?

not flat -- should be according to their directory hierarchy, like we have in datalad dandisets and files view on dandiarchive.

How should Zarr assets be represented? Should they just be directories of entries or something else?

my webdav knowledge is very limited... AFAIK "directory" in webdav is a collection as well, or in other words I do not know a way to have some different "types" of collections. So as such -- zarr assets indeed should be just directories, and then paths under them should be redirected to corresponding paths under corresponding zarr on S3.

What's the point of the dandisets/ path prefix? Is the root of the WebDAV service ever going to contain any entries other than dandisets/?

I thought of it indeed as some kind of future proofing since we do have separate "prefix/"es on S3, and indeed if there would be demand, we might want to expose later zarrs/ or metadata/ or some other elements.

jwodder commented 11 months ago

@yarikoptic

paths under them should be redirected to corresponding paths under corresponding zarr on S3.

What exactly do you mean by this? wsgidav doesn't provide any functionality for redirects. When a user navigates to a Zarr in a WebDAV client, do you want them to see a listing of the entries within the Zarr (possibly organized into a directory hierarchy) or something else?

yarikoptic commented 11 months ago

@yarikoptic

paths under them should be redirected to corresponding paths under corresponding zarr on S3.

What exactly do you mean by this? wsgidav doesn't provide any functionality for redirects.

hm,

the whole premise/point of working on webdav for us would be useful only if that webdav service could operate not by channeling bytes through it but rather just redirecting to corresponding URLs on S3. Hence that 2nd checkbox in original description of this issue
searching for redirect within wsgidav finds some hints and issue I filed was labeled (instead of closed saying "not possible") suggesting that generally it is possible for WEBDAV and likely wsgidav
should be just a regular HTTP redirect
- in the longer run (again -- if we find it useful) I thought we could make them 308 Permanent for assets in released versions, and 302 Found for assets in draft version. For now 302 would be good for all redirects.

When a user navigates to a Zarr in a WebDAV client, do you want them to see a listing of the entries within the Zarr (possibly organized into a directory hierarchy) or something else?

yes -- entries within Zarr organized into a directory hierarchy -- pretty much 1-to-1 as to how it is on S3. Overall example with what redirects/responses we need on an example

dandisets/000108/draft/ -> index of files and folders as in https://github.com/dandisets/000108
dandisets/000108/draft/dandiset.yaml -> yaml formatted output of GET of https://api.dandiarchive.org/api/dandisets/000108/versions/draft/
dandisets/000108/draft/dataset_description.json -> 302 redirect to https://dandiarchive.s3.amazonaws.com/blobs/c07/71a/c0771a4f-3483-47e7-821e-b28ac8df46a5
dandisets/000108/draft/sub-MITU01/ses-20210521h17m17s06/micr/sub-MITU01_ses-20210521h17m17s06_sample-178_stain-NN_run-1_chunk-1_SPIM.ome.zarr/ -> index from s3://dandiarchive/zarr/8b0493dd-32d6-4f75-8ca2-b57b28fc9695/
dandisets/000108/draft/sub-MITU01/ses-20210521h17m17s06/micr/sub-MITU01_ses-20210521h17m17s06_sample-178_stain-NN_run-1_chunk-1_SPIM.ome.zarr/0/.zarray -> 302 redirect to https://dandiarchive.s3.amazonaws.com/zarr/8b0493dd-32d6-4f75-8ca2-b57b28fc9695/0/.zarray

jwodder commented 11 months ago

@yarikoptic The WebDAV protocol is fine with redirects (See, e.g., RFC 4437), but the wsgidav implementation does not seem to support defining redirects.

jwodder commented 11 months ago

@yarikoptic Note that organizing assets and Zarr entries into directory hierarchies is not going to be trivial:

While the Archive has an endpoint that groups assets by directory, dandi-cli currently does not support it.
The Archive currently does not group Zarr entries by directory; support for doing so was removed in https://github.com/dandi/dandi-archive/pull/1394.

yarikoptic commented 11 months ago

@yarikoptic The WebDAV protocol is fine with redirects (See, e.g., RFC 4437),

phewph, good!

but the wsgidav implementation does not seem to support defining redirects.

:-/ is it possible to just reply with some standard HTTP response there?

@yarikoptic Note that organizing assets and Zarr entries into directory hierarchies is not going to be trivial:

While the Archive has an endpoint that groups assets by directory, dandi-cli currently does not support it.

even without adding to dandi-cli, is it much more just some client.paginate request directly to API?

The Archive currently does not group Zarr entries by directory; support for doing so was removed in Use flat file listing in zarr file browser dandi-archive#1394.

since embargoed zarrs are not even supported yet, everything is public, let's just use boto directly to get a listing of the "index" for S3 prefix.

yarikoptic commented 11 months ago

FWIW, for redirects there was a fresh followup https://github.com/mar10/wsgidav/issues/303#issuecomment-1854559322 confirming that not directly supported but possibly relatively easy to add to test the idea out

jwodder commented 11 months ago

@yarikoptic

is it possible to just reply with some standard HTTP response there?

Custom providers have no control over HTTP responses. The library just calls them to get lists of resources and their properties.

let's just use boto directly to get a listing of the "index" for S3 prefix.

I don't believe boto or S3 have any concept of directories. File paths are just opaque keys, and any slashes the user puts in them are purely a way for the user to organize things.

yarikoptic commented 11 months ago

@yarikoptic

is it possible to just reply with some standard HTTP response there?

Custom providers have no control over HTTP responses. The library just calls them to get lists of resources and their properties.

what about the overload approach mentioned in https://github.com/mar10/wsgidav/issues/303#issuecomment-1854559322 ?

let's just use boto directly to get a listing of the "index" for S3 prefix.

I don't believe boto or S3 have any concept of directories. File paths are just opaque keys, and any slashes the user puts in them are purely a way for the user to organize things.

nope. I did listing of directories in datalad (now in datalad-deprecated) with old boto and you can quickly (takes no time -- less than a sec) do that e.g. with

❯ time s3cmd -c ~/.s3cfg-dandi-backup ls s3://dandiarchive/zarr/8b0493dd-32d6-4f75-8ca2-b57b28fc9695/
                          DIR  s3://dandiarchive/zarr/8b0493dd-32d6-4f75-8ca2-b57b28fc9695/0/
                          DIR  s3://dandiarchive/zarr/8b0493dd-32d6-4f75-8ca2-b57b28fc9695/1/
                          DIR  s3://dandiarchive/zarr/8b0493dd-32d6-4f75-8ca2-b57b28fc9695/2/
                          DIR  s3://dandiarchive/zarr/8b0493dd-32d6-4f75-8ca2-b57b28fc9695/3/
                          DIR  s3://dandiarchive/zarr/8b0493dd-32d6-4f75-8ca2-b57b28fc9695/4/
                          DIR  s3://dandiarchive/zarr/8b0493dd-32d6-4f75-8ca2-b57b28fc9695/5/
                          DIR  s3://dandiarchive/zarr/8b0493dd-32d6-4f75-8ca2-b57b28fc9695/6/
2022-04-21 23:26         7859  s3://dandiarchive/zarr/8b0493dd-32d6-4f75-8ca2-b57b28fc9695/.zattrs
2022-02-26 22:22           24  s3://dandiarchive/zarr/8b0493dd-32d6-4f75-8ca2-b57b28fc9695/.zgroup
2022-04-21 23:26        14925  s3://dandiarchive/zarr/8b0493dd-32d6-4f75-8ca2-b57b28fc9695/.zmetadata
s3cmd -c ~/.s3cfg-dandi-backup ls   0.10s user 0.02s system 19% cpu 0.572 total

in cli.

chatgpt gave following example code for boto3 which runs in 0.5 sec locally for me, so not listing entire zarr there and there could be even better ways may be

```shell import boto3 from botocore import UNSIGNED from botocore.client import Config # Create a new S3 client with anonymous access s3_client = boto3.client('s3', config=Config(signature_version=UNSIGNED)) bucket_name = 'dandiarchive' prefix = 'zarr/8b0493dd-32d6-4f75-8ca2-b57b28fc9695/' def list_directories_and_files(bucket, prefix): paginator = s3_client.get_paginator('list_objects_v2') result = paginator.paginate(Bucket=bucket, Prefix=prefix, Delimiter='/') for page in result: if "CommonPrefixes" in page: for subdir in page['CommonPrefixes']: print('Subdirectory: ' + subdir['Prefix']) if "Contents" in page: for file in page['Contents']: if not file['Key'].endswith('/'): print('File: ' + file['Key']) list_directories_and_files(bucket_name, prefix) ``` ``` ❯ time python <(xclip -o) Subdirectory: zarr/8b0493dd-32d6-4f75-8ca2-b57b28fc9695/0/ Subdirectory: zarr/8b0493dd-32d6-4f75-8ca2-b57b28fc9695/1/ Subdirectory: zarr/8b0493dd-32d6-4f75-8ca2-b57b28fc9695/2/ Subdirectory: zarr/8b0493dd-32d6-4f75-8ca2-b57b28fc9695/3/ Subdirectory: zarr/8b0493dd-32d6-4f75-8ca2-b57b28fc9695/4/ Subdirectory: zarr/8b0493dd-32d6-4f75-8ca2-b57b28fc9695/5/ Subdirectory: zarr/8b0493dd-32d6-4f75-8ca2-b57b28fc9695/6/ File: zarr/8b0493dd-32d6-4f75-8ca2-b57b28fc9695/.zattrs File: zarr/8b0493dd-32d6-4f75-8ca2-b57b28fc9695/.zgroup File: zarr/8b0493dd-32d6-4f75-8ca2-b57b28fc9695/.zmetadata python <(xclip -o) 0.22s user 0.03s system 49% cpu 0.512 total ```

jwodder commented 11 months ago

@yarikoptic

what about the overload approach mentioned in https://github.com/mar10/wsgidav/issues/303#issuecomment-1854559322 ?

I have no idea how to implement that as a user of wsgidav without forking wsgidav.

yarikoptic commented 9 months ago

update: @jwodder redid in Rust in https://github.com/dandi/dandidav . A sample instance is running at https://dandi.centerforopenneuroscience.org/ (not automatically deployed). Sample external services URLs to try:

Next stage would be deployment:

@dandi/archive-admin could you guide through the steps for "properly" "integrating" that beastie as webdav.dandiarchive.org within our infrastructure so configuration is centralized etc.
@satra -- is it for you to register a subdomain?

satra commented 9 months ago

the subdomain can be registered through our aws account route53, but that involves us running the service. has forwarding to s3 for retrieval been implemented? if not, i would at least start with integrating at a local level that anyone could run it. if yes, then proceed with the setup.

yarikoptic commented 9 months ago

has forwarding to s3 for retrieval been implemented?

AFAIK yes. @jwodder can confirm if that is generally so. Known exception is dandiset.yaml (rust knowledgeable folks can review source). Anyways would be nice to also add some traffic/load/requests stats for that node to see how well it copes under load.

yarikoptic commented 9 months ago

@satra @waxlamp I would like to proceed with moving dandidav deployment into the "official" dandiarchive.org space from its temporary https://dandi.centerforopenneuroscience.org/ .

Please guide us with @jwodder through on what we need to do to accomplish the drill.

satra commented 9 months ago

create a new instance in the aws account or heroku account and add a route53 cname alias for it. ideally, there are a few considerations with respect to horizontal scalability, but before we do that get a basic setup running. also estimate the costs of this service based on the infrastructure you choose. pinging @aaronkanzer who may be able to help with some considerations depending on choices.

perhaps a devops doc could help you and others in the future as to how to deploy new services. note that once we move over k8s to 2i2c, we will want to use that substrate for future services.

yarikoptic commented 9 months ago

am I correct that heroku would be better target since it would hard limit us on resources so we do not break the bank?

satra commented 9 months ago

you can limit things in aws as well (fixed instance, no load balancer, etc.,.). but it may be quicker/easier in heroku.

waxlamp commented 9 months ago

It seems there are Rust buildpacks for Heroku. As for infrastructure, I think it would be prudent to manage the necessaries through our Terraform setup.

@mvandenburgh, I think you have the necessary background to look into this and formulate an operations plan. Could you please start with these two questions:

How do we deploy a Rust-based web application on Heroku?
What AWS resources do we need to develop in our TF materials?

jwodder commented 9 months ago

Side note: I'd appreciate it if the deployment procedure preserved enough of the Git history for it to be possible to run git rev-parse --short HEAD in order to embed the current Git commit in the binary.

yarikoptic commented 9 months ago

may be also @kabilar and @aaronkanzer could help on this end since they are replicating DANDI infrastructure setup ?

aaronkanzer commented 9 months ago

I don't have the bandwidth for a few days, but I might suggest doing a proof-of-concept outside of dandi-infrastructure first alone in Heroku since dandi-infrastructure is quite coupled with the Girder Terraform submodule (which, fortunately/unfortunately not overridable in Terraform land) -- feel free to correct me @waxlamp @mvandenburgh

@yarikoptic @jwodder perhaps just provisioning a Heroku dyno and including a Procfile with your Rust app and pushing it to the corresponding Heroku dyno would be good enough? (Just expanding on https://github.com/dandi/dandi-infrastructure/issues/166#issuecomment-1958516864)

Then you could have observability and stress-testing in the short-term in a more production-ish setting -- Heroku fortunately provisions an HTTPS-ready API URL (which eventually we would CNAME in Route53 to webdav.dandiarching.org) with the dyno. If successful, we could append or easily build out more IaC in dandi-infrastructure.

Just some thoughts...

mvandenburgh commented 9 months ago

I don't have the bandwidth for a few days, but I might suggest doing a proof-of-concept outside of dandi-infrastructure first alone in Heroku since dandi-infrastructure is quite coupled with the Girder Terraform submodule (which, fortunately/unfortunately not overridable in Terraform land) -- feel free to correct me @waxlamp @mvandenburgh

Thanks @aaronkanzer, I definitely agree it makes sense to do an initial proof of concept outside of Terraform.

@yarikoptic @jwodder perhaps just provisioning a Heroku dyno and including a Procfile with your Rust app and pushing it to the corresponding Heroku dyno would be good enough? (Just expanding on #166 (comment))

Then you could have observability and stress-testing in the short-term in a more production-ish setting -- Heroku fortunately provisions an HTTPS-ready API URL (which eventually we would CNAME in Route53 to webdav.dandiarching.org) with the dyno. If successful, we could append or easily build out more IaC in dandi-infrastructure.

Agreed, I think this is the approach we should take - I'll start out by trying to set this up.

mvandenburgh commented 8 months ago

Side note: I'd appreciate it if the deployment procedure preserved enough of the Git history for it to be possible to run git rev-parse --short HEAD in order to embed the current Git commit in the binary.

@jwodder Heroku provides these values as environment variables at runtime - https://devcenter.heroku.com/articles/dyno-metadata#dyno-metadata. Is using the HEROKU_SLUG_COMMIT environment variable sufficient here?

jwodder commented 8 months ago

@mvandenburgh I've created a PR to fetch the Git commit from HEROKU_SLUG_COMMIT if no normal Git information is available: https://github.com/dandi/dandidav/pull/95

dandi / dandi-infrastructure

Expose dandiarchive as webdav service #166