dandi / dandi-infrastructure

A repository to collect docs/issues on DANDI project infrastructure
Apache License 2.0
0 stars 6 forks source link

Expose dandiarchive as webdav service #166

Closed yarikoptic closed 8 months ago

yarikoptic commented 11 months ago

To facilitate integration in various external projects (e.g., OSDF with Pelican underneath) which can interface to webdav services. Treat it as unification of our API to a standard API for file access.

TODOs:

may be it could be implemented as independent service, not part of API.

attn @jwodder -- how much did you play with webdav?

edits:

jwodder commented 11 months ago

@yarikoptic

how much did you play with webdav?

None.

satra commented 11 months ago

i would add a check in this issue for checking whether webdav can be done without egress data bytes through a server.

yarikoptic commented 11 months ago

i would add a check in this issue for checking whether webdav can be done without egress data bytes through a server.

I added a check: and here is a reply from chatgpt (didn't try yet) Yes, you can implement a WebDAV backend that does not store or provide the actual file bytes but instead redirects to a target URL elsewhere when a client requests a file. This can be achieved by customizing the behavior of the WebDAV server to handle requests in a way that serves external resources through redirection. Here's a simplified example of how you can create a WebDAV server that redirects requests to external URLs using the `PyWebDAV` library in Python: ```python from pywebdav.server import DAVServer from pywebdav.types import DAVError class RedirectDAVResource(DAVResource): def __init__(self, path, redirect_url): super().__init__(path) self.redirect_url = redirect_url def GET(self): raise DAVError(302, self.redirect_url) class MyDAVServer(DAVServer): def __init__(self, root_path): super().__init__() self.root_path = root_path def get_resource_inst(self, path): # In this example, we return a RedirectDAVResource for all files # You can add logic here to determine if a file should be redirected return RedirectDAVResource(path, "https://example.com/external/resource") if __name__ == '__main__': server = MyDAVServer('/path/to/your/data/store') server.run() ``` In this example, we define a custom `RedirectDAVResource` class that inherits from `DAVResource`. When a `GET` request is made to this resource, it raises a `302` HTTP status code with a `Location` header set to the desired redirection URL. The `get_resource_inst` method of the `MyDAVServer` class returns instances of `RedirectDAVResource` for all requested files, but you can customize this logic to decide which files should be redirected and specify the target URL accordingly. Please note that this is a basic example, and you can expand on it to meet your specific requirements for redirection, access control, and handling different types of resources. Depending on your use case, you might also want to handle other WebDAV methods (e.g., PUT, DELETE) as needed.
yarikoptic commented 11 months ago

I think the best would be to start with an independent service to just try feasibility etc, using https://github.com/mar10/wsgidav and make it for now so we could just try locally and if it would work nicely then we reassess inclusion into dandi-archive.

@jwodder could you please provide a prototype imlementation using https://github.com/mar10/wsgidav and looking at available backends e.g. for mercurial: https://wsgidav.readthedocs.io/en/latest/addons-mercurial.html and others, so we establish webdav over api.dandiarchive.org to expose it as a read-only webdav where it would have a tree

for individual files, should forward to the s3 http URL.

jwodder commented 11 months ago

@yarikoptic Where should this wsgidav instance be deployed?

yarikoptic commented 11 months ago

for now I have no ready for it hosting -- should be code/service anyone could just run/try locally. If we see that it works well, we would then either proceed adapting it within our dandi-archive instance(s) or look into establishing a separate hosting for it.

jwodder commented 11 months ago

@yarikoptic

yarikoptic commented 11 months ago

@yarikoptic

  • I'm assuming you want the WebDAV view to be read-only; is that correct?

For this portion - yes! Depending on our success with it in read-only facility, we might (much later) want to look into providing support for uploading stuff too through it -- might be quite cool / user-friendly (as long as we can provide feedback etc).

  • How exactly should assets under a given version be laid out? Should there be a flat listing of all assets in the version, or should the assets be grouped into the directory hierarchy implied by the forward slashes in their paths?

not flat -- should be according to their directory hierarchy, like we have in datalad dandisets and files view on dandiarchive.

  • How should Zarr assets be represented? Should they just be directories of entries or something else?

my webdav knowledge is very limited... AFAIK "directory" in webdav is a collection as well, or in other words I do not know a way to have some different "types" of collections. So as such -- zarr assets indeed should be just directories, and then paths under them should be redirected to corresponding paths under corresponding zarr on S3.

  • What's the point of the dandisets/ path prefix? Is the root of the WebDAV service ever going to contain any entries other than dandisets/?

I thought of it indeed as some kind of future proofing since we do have separate "prefix/"es on S3, and indeed if there would be demand, we might want to expose later zarrs/ or metadata/ or some other elements.

jwodder commented 11 months ago

@yarikoptic

paths under them should be redirected to corresponding paths under corresponding zarr on S3.

What exactly do you mean by this? wsgidav doesn't provide any functionality for redirects. When a user navigates to a Zarr in a WebDAV client, do you want them to see a listing of the entries within the Zarr (possibly organized into a directory hierarchy) or something else?

yarikoptic commented 11 months ago

@yarikoptic

paths under them should be redirected to corresponding paths under corresponding zarr on S3.

What exactly do you mean by this? wsgidav doesn't provide any functionality for redirects.

hm,

When a user navigates to a Zarr in a WebDAV client, do you want them to see a listing of the entries within the Zarr (possibly organized into a directory hierarchy) or something else?

yes -- entries within Zarr organized into a directory hierarchy -- pretty much 1-to-1 as to how it is on S3. Overall example with what redirects/responses we need on an example

jwodder commented 11 months ago

@yarikoptic The WebDAV protocol is fine with redirects (See, e.g., RFC 4437), but the wsgidav implementation does not seem to support defining redirects.

jwodder commented 11 months ago

@yarikoptic Note that organizing assets and Zarr entries into directory hierarchies is not going to be trivial:

yarikoptic commented 11 months ago

@yarikoptic The WebDAV protocol is fine with redirects (See, e.g., RFC 4437),

phewph, good!

but the wsgidav implementation does not seem to support defining redirects.

:-/ is it possible to just reply with some standard HTTP response there?

@yarikoptic Note that organizing assets and Zarr entries into directory hierarchies is not going to be trivial:

  • While the Archive has an endpoint that groups assets by directory, dandi-cli currently does not support it.

even without adding to dandi-cli, is it much more just some client.paginate request directly to API?

since embargoed zarrs are not even supported yet, everything is public, let's just use boto directly to get a listing of the "index" for S3 prefix.

yarikoptic commented 11 months ago

FWIW, for redirects there was a fresh followup https://github.com/mar10/wsgidav/issues/303#issuecomment-1854559322 confirming that not directly supported but possibly relatively easy to add to test the idea out

jwodder commented 11 months ago

@yarikoptic

is it possible to just reply with some standard HTTP response there?

Custom providers have no control over HTTP responses. The library just calls them to get lists of resources and their properties.

let's just use boto directly to get a listing of the "index" for S3 prefix.

I don't believe boto or S3 have any concept of directories. File paths are just opaque keys, and any slashes the user puts in them are purely a way for the user to organize things.

yarikoptic commented 11 months ago

@yarikoptic

is it possible to just reply with some standard HTTP response there?

Custom providers have no control over HTTP responses. The library just calls them to get lists of resources and their properties.

what about the overload approach mentioned in https://github.com/mar10/wsgidav/issues/303#issuecomment-1854559322 ?

let's just use boto directly to get a listing of the "index" for S3 prefix.

I don't believe boto or S3 have any concept of directories. File paths are just opaque keys, and any slashes the user puts in them are purely a way for the user to organize things.

nope. I did listing of directories in datalad (now in datalad-deprecated) with old boto and you can quickly (takes no time -- less than a sec) do that e.g. with

❯ time s3cmd -c ~/.s3cfg-dandi-backup ls s3://dandiarchive/zarr/8b0493dd-32d6-4f75-8ca2-b57b28fc9695/
                          DIR  s3://dandiarchive/zarr/8b0493dd-32d6-4f75-8ca2-b57b28fc9695/0/
                          DIR  s3://dandiarchive/zarr/8b0493dd-32d6-4f75-8ca2-b57b28fc9695/1/
                          DIR  s3://dandiarchive/zarr/8b0493dd-32d6-4f75-8ca2-b57b28fc9695/2/
                          DIR  s3://dandiarchive/zarr/8b0493dd-32d6-4f75-8ca2-b57b28fc9695/3/
                          DIR  s3://dandiarchive/zarr/8b0493dd-32d6-4f75-8ca2-b57b28fc9695/4/
                          DIR  s3://dandiarchive/zarr/8b0493dd-32d6-4f75-8ca2-b57b28fc9695/5/
                          DIR  s3://dandiarchive/zarr/8b0493dd-32d6-4f75-8ca2-b57b28fc9695/6/
2022-04-21 23:26         7859  s3://dandiarchive/zarr/8b0493dd-32d6-4f75-8ca2-b57b28fc9695/.zattrs
2022-02-26 22:22           24  s3://dandiarchive/zarr/8b0493dd-32d6-4f75-8ca2-b57b28fc9695/.zgroup
2022-04-21 23:26        14925  s3://dandiarchive/zarr/8b0493dd-32d6-4f75-8ca2-b57b28fc9695/.zmetadata
s3cmd -c ~/.s3cfg-dandi-backup ls   0.10s user 0.02s system 19% cpu 0.572 total

in cli.

chatgpt gave following example code for boto3 which runs in 0.5 sec locally for me, so not listing entire zarr there and there could be even better ways may be ```shell import boto3 from botocore import UNSIGNED from botocore.client import Config # Create a new S3 client with anonymous access s3_client = boto3.client('s3', config=Config(signature_version=UNSIGNED)) bucket_name = 'dandiarchive' prefix = 'zarr/8b0493dd-32d6-4f75-8ca2-b57b28fc9695/' def list_directories_and_files(bucket, prefix): paginator = s3_client.get_paginator('list_objects_v2') result = paginator.paginate(Bucket=bucket, Prefix=prefix, Delimiter='/') for page in result: if "CommonPrefixes" in page: for subdir in page['CommonPrefixes']: print('Subdirectory: ' + subdir['Prefix']) if "Contents" in page: for file in page['Contents']: if not file['Key'].endswith('/'): print('File: ' + file['Key']) list_directories_and_files(bucket_name, prefix) ``` ``` ❯ time python <(xclip -o) Subdirectory: zarr/8b0493dd-32d6-4f75-8ca2-b57b28fc9695/0/ Subdirectory: zarr/8b0493dd-32d6-4f75-8ca2-b57b28fc9695/1/ Subdirectory: zarr/8b0493dd-32d6-4f75-8ca2-b57b28fc9695/2/ Subdirectory: zarr/8b0493dd-32d6-4f75-8ca2-b57b28fc9695/3/ Subdirectory: zarr/8b0493dd-32d6-4f75-8ca2-b57b28fc9695/4/ Subdirectory: zarr/8b0493dd-32d6-4f75-8ca2-b57b28fc9695/5/ Subdirectory: zarr/8b0493dd-32d6-4f75-8ca2-b57b28fc9695/6/ File: zarr/8b0493dd-32d6-4f75-8ca2-b57b28fc9695/.zattrs File: zarr/8b0493dd-32d6-4f75-8ca2-b57b28fc9695/.zgroup File: zarr/8b0493dd-32d6-4f75-8ca2-b57b28fc9695/.zmetadata python <(xclip -o) 0.22s user 0.03s system 49% cpu 0.512 total ```
jwodder commented 11 months ago

@yarikoptic

what about the overload approach mentioned in https://github.com/mar10/wsgidav/issues/303#issuecomment-1854559322 ?

I have no idea how to implement that as a user of wsgidav without forking wsgidav.

yarikoptic commented 9 months ago

update: @jwodder redid in Rust in https://github.com/dandi/dandidav . A sample instance is running at https://dandi.centerforopenneuroscience.org/ (not automatically deployed). Sample external services URLs to try:

Next stage would be deployment:

satra commented 9 months ago

the subdomain can be registered through our aws account route53, but that involves us running the service. has forwarding to s3 for retrieval been implemented? if not, i would at least start with integrating at a local level that anyone could run it. if yes, then proceed with the setup.

yarikoptic commented 9 months ago

has forwarding to s3 for retrieval been implemented?

AFAIK yes. @jwodder can confirm if that is generally so. Known exception is dandiset.yaml (rust knowledgeable folks can review source). Anyways would be nice to also add some traffic/load/requests stats for that node to see how well it copes under load.

yarikoptic commented 9 months ago

@satra @waxlamp I would like to proceed with moving dandidav deployment into the "official" dandiarchive.org space from its temporary https://dandi.centerforopenneuroscience.org/ .

Please guide us with @jwodder through on what we need to do to accomplish the drill.

satra commented 9 months ago

create a new instance in the aws account or heroku account and add a route53 cname alias for it. ideally, there are a few considerations with respect to horizontal scalability, but before we do that get a basic setup running. also estimate the costs of this service based on the infrastructure you choose. pinging @aaronkanzer who may be able to help with some considerations depending on choices.

perhaps a devops doc could help you and others in the future as to how to deploy new services. note that once we move over k8s to 2i2c, we will want to use that substrate for future services.

yarikoptic commented 9 months ago

am I correct that heroku would be better target since it would hard limit us on resources so we do not break the bank?

satra commented 9 months ago

you can limit things in aws as well (fixed instance, no load balancer, etc.,.). but it may be quicker/easier in heroku.

waxlamp commented 9 months ago

It seems there are Rust buildpacks for Heroku. As for infrastructure, I think it would be prudent to manage the necessaries through our Terraform setup.

@mvandenburgh, I think you have the necessary background to look into this and formulate an operations plan. Could you please start with these two questions:

  1. How do we deploy a Rust-based web application on Heroku?
  2. What AWS resources do we need to develop in our TF materials?
jwodder commented 9 months ago

Side note: I'd appreciate it if the deployment procedure preserved enough of the Git history for it to be possible to run git rev-parse --short HEAD in order to embed the current Git commit in the binary.

yarikoptic commented 9 months ago

may be also @kabilar and @aaronkanzer could help on this end since they are replicating DANDI infrastructure setup ?

aaronkanzer commented 9 months ago

I don't have the bandwidth for a few days, but I might suggest doing a proof-of-concept outside of dandi-infrastructure first alone in Heroku since dandi-infrastructure is quite coupled with the Girder Terraform submodule (which, fortunately/unfortunately not overridable in Terraform land) -- feel free to correct me @waxlamp @mvandenburgh

@yarikoptic @jwodder perhaps just provisioning a Heroku dyno and including a Procfile with your Rust app and pushing it to the corresponding Heroku dyno would be good enough? (Just expanding on https://github.com/dandi/dandi-infrastructure/issues/166#issuecomment-1958516864)

Then you could have observability and stress-testing in the short-term in a more production-ish setting -- Heroku fortunately provisions an HTTPS-ready API URL (which eventually we would CNAME in Route53 to webdav.dandiarching.org) with the dyno. If successful, we could append or easily build out more IaC in dandi-infrastructure.

Just some thoughts...

mvandenburgh commented 9 months ago

I don't have the bandwidth for a few days, but I might suggest doing a proof-of-concept outside of dandi-infrastructure first alone in Heroku since dandi-infrastructure is quite coupled with the Girder Terraform submodule (which, fortunately/unfortunately not overridable in Terraform land) -- feel free to correct me @waxlamp @mvandenburgh

Thanks @aaronkanzer, I definitely agree it makes sense to do an initial proof of concept outside of Terraform.

@yarikoptic @jwodder perhaps just provisioning a Heroku dyno and including a Procfile with your Rust app and pushing it to the corresponding Heroku dyno would be good enough? (Just expanding on #166 (comment))

Then you could have observability and stress-testing in the short-term in a more production-ish setting -- Heroku fortunately provisions an HTTPS-ready API URL (which eventually we would CNAME in Route53 to webdav.dandiarching.org) with the dyno. If successful, we could append or easily build out more IaC in dandi-infrastructure.

Agreed, I think this is the approach we should take - I'll start out by trying to set this up.

mvandenburgh commented 8 months ago

Side note: I'd appreciate it if the deployment procedure preserved enough of the Git history for it to be possible to run git rev-parse --short HEAD in order to embed the current Git commit in the binary.

@jwodder Heroku provides these values as environment variables at runtime - https://devcenter.heroku.com/articles/dyno-metadata#dyno-metadata. Is using the HEROKU_SLUG_COMMIT environment variable sufficient here?

jwodder commented 8 months ago

@mvandenburgh I've created a PR to fetch the Git commit from HEROKU_SLUG_COMMIT if no normal Git information is available: https://github.com/dandi/dandidav/pull/95