PelicanPlatform / xrootd-s3-http

An XRootD plugin that allows Pelican to interface with s3/http server backends
Apache License 2.0
2 stars 6 forks source link

Implement directory listing #35

Closed bbockelm closed 5 months ago

bbockelm commented 6 months ago

Implements the ListObjectsV2 query against the S3 endpoint, allowing XRootD to interpret a directory-like structure in S3 as a XRootD directory.

This has been lightly tested against public S3 buckets using curl queries like this:

$ curl -k -H 'Depth: 1' -X PROPFIND https://f4hp7ql65f.local:1094/test/cells/muscle-ibm/endothelial-stromal-cells/ -d @$HOME/.config/xrootd/prop_query

where the query file is as follows:

$ cat ~/.config/xrootd/prop_query 
<d:propfind xmlns:d='DAV:'>
  <d:prop>
  <d:displayname/>
  <d:resourcetype/>
  <d:getcontentlength/>
  <d:getcontenttype/>
  <d:getetag/>
  <d:getlastmodified/>
</d:prop>

It is the first use of query parameters in the signature creation code -- that might need some attention. Also did some modest refactoring around how the requests are passed (mostly to avoid repeating all the configuration strings). Needs to have path-style URLs tested and simple GET/PUT of data to look for regressions.

bbockelm commented 6 months ago

Forgot to mention -- here's the xrootd configuration in use:

s3.begin
s3.path_name        /test
s3.bucket_name      genome-browser
s3.service_name     s3.amazonaws.com
s3.region           us-east-1
s3.service_url      https://s3.us-east-1.amazonaws.com
s3.url_style        virtual
s3.end
jhiemstrawisc commented 6 months ago

Further testing shows this also breaks more significantly. With a minimal config of:

all.export  /
xrd.protocol http:8443 libXrdHttp.so
ofs.osslib /workspaces/pelican_xrootd_s3/xrootd-s3-http/build/libXrdS3.so
xrootd.async off

s3.url_style path
s3.begin
s3.path_name /aws-opendata
s3.service_name s3
s3.region us-east-1
s3.service_url https://s3.us-east-1.amazonaws.com
s3.end

The first GET test I usually run results in a segfault. From Curl:

curl -v http://`hostname`:8443/aws-opendata/noaa-wod-pds/MD5SUMS
*   Trying 172.17.0.4:8443...
* Connected to 0f6eebf9123e (172.17.0.4) port 8443 (#0)
> GET /aws-opendata/noaa-wod-pds/MD5SUMS HTTP/1.1
> Host: 0f6eebf9123e:8443
> User-Agent: curl/7.76.1
> Accept: */*
> 
* Empty reply from server
* Closing connection 0
curl: (52) Empty reply from server

And from the server:

s3_Stat: Stat'ing path /aws-opendata/noaa-wod-pds/MD5SUMS
240524 20:07:02 68098 s3_SendRequest: Sending HTTP request https://.s3.us-east-1.amazonaws.com/?delimiter=%2F&list-type=2&max-keys=1000&prefix=noaa-wod-pds%2FMD5SUMS
240524 20:07:02 68098 s3_Stat: Failed to stat path /aws-opendata/noaa-wod-pds/MD5SUMS; response code 0
240524 20:07:02 68098 ofs_stat: unknown.1:29@0f6eebf9123e Unable to locate /aws-opendata/noaa-wod-pds/MD5SUMS; input/output error
240524 20:07:02 68098 http_Req:  XrdHttpReq::Error
240524 20:07:02 68098 unknown.1:29@0f6eebf9123e http_Req: PostProcessHTTPReq req: 2 reqstate: 0 final_:False
240524 20:07:02 68098 unknown.1:29@0f6eebf9123e http_Req: PostProcessHTTPReq mapping Xrd error [3007] to status code [500]
240524 20:07:02 68098 unknown.1:29@0f6eebf9123e http_Protocol:  Process. lp:(nil) reqstate: 0
240524 20:07:02 68098 unknown.1:29@0f6eebf9123e http_Req: No checksum requested; skipping to request state 2
240524 20:07:02 68098 unknown.1:29@0f6eebf9123e http_Protocol: Process is exiting rc:0
240524 20:07:02 68098 unknown.1:29@0f6eebf9123e ofs_open: 0-600 (600) fn=/aws-opendata/noaa-wod-pds/MD5SUMS
240524 20:07:02 68098 s3_S3File::Open: Opening file /aws-opendata/noaa-wod-pds/MD5SUMS
Segmentation fault (core dumped)

Are the new URL queries supposed to be added on a basic GET like this? It also looks like the URL generated in s3_SendRequest isn't coming out correctly, because exporting an entire S3 endpoint requires not setting a bucket.

bbockelm commented 6 months ago

Yup - my testing was all for specifying a bucket in the s3.begin ... s3.end block. I can go back and test that with your reproducer.

Can you confirm that if you specify a single bucket it works though?

jhiemstrawisc commented 6 months ago

Hardcoding the bucket with the config:

s3.url_style path
s3.begin
s3.path_name /aws-opendata
s3.bucket_name noaa-wod-pds
s3.service_name s3
s3.region us-east-1
s3.service_url https://s3.us-east-1.amazonaws.com
s3.end

still produces a segfault when I curl -v http://$HOSTNAME:8443/aws-opendata/MD5SUMS. I'm also noticing that s3_SendRequest: Sending HTTP request https://noaa-wod-pds.s3.us-east-1.amazonaws.com/?delimiter=%2F&list-type=2&max-keys=1000&prefix=noaa-wod-pds%2FMD5SUMS is constructing a virtual bucket URL instead of the configured path-style url.

rw2 commented 5 months ago

I'm back from vacation. Let me know if I can help with any of this, but I don't want to duplicate work.

jhiemstrawisc commented 5 months ago

@rw2, definitely! Do you want to see if you can start stitching up some of the missing pieces to get this working? We have a lot of interest in this right now.

rw2 commented 5 months ago

ok, working on it now. I'll ping you on slack if I have questions.

bbockelm commented 5 months ago

@rw2 - any updates on this? I'd like to get the directory listing wrapped into this month's release, which nominally happens on Thursday.

rw2 commented 5 months ago

Last Thu/Fri: Wouldn't compile for me, which was weird. There must be something different in our environments. Fixed compile. Reproduced it. Got rid of the segfault, but it was a symptom, so the file didn't download. Couldn't work on it yesterday, back at it again this morning.

bbockelm commented 5 months ago

@jhiemstrawisc -- ready for re-review!

Beyond fixing the original branch (and some patches from @rw2), this also adds unit tests for the listing functionality and pre-commit settings (so I stop getting caught by the clang-format linter)!