DigitalSlideArchive / digital_slide_archive

The official deployment of the Digital Slide Archive and HistomicsTK.
https://digitalslidearchive.github.io
Apache License 2.0
108 stars 49 forks source link

Expected S3 behavior: Issues with local S3 server #318

Open codybum opened 7 months ago

codybum commented 7 months ago

We are experimenting with the use of a locally deployed java-based (https://github.com/mindmill/ladon-s3-server) S3 server with DSA. We have made numerous updates to the base code, which now supports most boto3 (https://boto3.amazonaws.com/v1/documentation/api/latest/index.html) and CyberDuck (https://cyberduck.io/) commands. We are testing with a single 800M Aperio SVS file "test.svs".

We are able to successfully add the local S3 server as an Assetstore without issue. Girder PUT and DELETE test complete successfully without error. Import of the test file is slow, but it completes successfully according to DSA. We are able to access the test file in a Collection and view the slide in HistomicsUI. While things mostly work, tile updates are very slow and sporadic.

On the S3 side there are no errors until slide import. In network captures we can see ~50 HTTP request to the S3 server during the single file import process. We see DSA request the list of objects in the bucket, then a complete object request for test.svs. The S3 server starts the transfer to DSA for the complete object request, we see the window size on the DSA size for the transfer +49k until ~10 packets, at which point it is reduced to 8192, and then in the next packet we see a TCP reset (RST,ACK) from DSA. A (RST,ACK) from DSA would typically indicate that the socket on the DSA side has been closed. We observe a write error on the S3 side that also throws an error saying the socket is already closed.

The packet capture can be seen here:

image

In subsequent DSA request the "Range: bytes=[start_byte-end_byte]" header is set, but as with the full object request after some number of packets we see the the TCP window size eventually decrease and we see a reset from the DSA side. The range/partial requests are not the same, the offset (start_byte) changes while the end_byte remains the same. All subsequent ranged request are toward the end of the test.svs file.

In attempts to debug this issue, I partially re-implemented the downloadFile function (https://girder.readthedocs.io/en/latest/_modules/girder/utility/s3_assetstore_adapter.html#S3AssetstoreAdapter.downloadFile) from Girder in my test code. I took the HTTP request from the import and re-ran the request, confirming that the MD5 hashes provided by the S3 server are identical to the local file. The transfers end with (FIN,ACK), as expected.

image

It is far more likely there is something wrong with the S3 implementation than with DSA, but I am having a hard time reproducing these issues outside of DSA.

A few questions to help me along:

manthey commented 7 months ago

DSA uses a variety of libraries to read images. Many of these libraries require file-like access to the images, so rather than fetching them directly from S3, we expose the data in girder in a FUSE file system and the image libraries read those files. Some of the image libraries (notably openslide) do a lot of open-seek-read-close on small fragments for files (especially on ndpi files) and these manifest as range requests.

Internally, the FUSE file system uses the python requests library to fetch from s3 with a get call with stream=True. I don't see any code where we explicitly end that call; we let the python garbage collector do whatever it does and trust the requests library to do the right thing. Perhaps this should be explicitly closed, but it isn't.

There is an option to enable diskcache on the mount (see https://github.com/DigitalSlideArchive/digital_slide_archive/blob/master/devops/dsa/docker-compose.yml#L26-L28); this will make 128kB range requests (I think) rather than really small range requests and cache the results so you end up with vastly fewer partial requests.

I would expect reading the first bytes of the file, even if the image library doesn't mean to read the whole thing, will appear as a full object request (since from a file system level it is an open without seek). This is dependent on the image library doing the reading as well as the file format.

I'm not sure how to proxy requests to one S3 server to another.

manthey commented 7 months ago

Yes, that is the default for where and how fuse is mounted. There are some loopback example fuse implementations that would leave libfuse in the loop but not have any of our code. I'd expect arbitrary seeks and reads to work through to the S3 server.

On Wed, Feb 14, 2024 at 6:27 AM V. K. Cody Bumgardner < @.***> wrote:

Thanks @manthey https://github.com/manthey this was very helpful. It looks like we were going down the right debugging path for a typical Girder S3 Assetstore, but not the right path for DSA. Would if be fair to say that we need to confirm our custom S3 server works appropriately with FUSE before worrying about what is happening within DSA and the slide libraries?

Could you confirm the following, so that we might create a test environment similar to what is used by DSA:

The following packages appear to be installed in the DSA container: fuse/now 2.9.9-5ubuntu3 amd64 [installed,local] libfuse2/now 2.9.9-5ubuntu3 amd64 [installed,local]

I also see the python library fusepy (https://github.com/fusepy/fusepy) installed. @.***:/opt# pip list | grep fuse fusepy 3.0.1

You appear to be mounting the FUSE file system via Girder cli mount/: https://github.com/girder/girder/blob/e0a12ff2c5a74649833313ec2c374b8653390892/girder/cli/mount.py#L542

FUSE is mounted on /fuse ServerFuse on /fuse type fuse (ro,nosuid,nodev,relatime,user_id=0,group_id=0)

Here is our slide data appearing as a file on a local filesystem: @.***:/fuse/collection/slides/slide_data/test.svs# ls -la total 101911 dr-x------ 1 root root 0 Feb 12 21:03 . dr-x------ 1 root root 0 Feb 12 21:03 .. -r-------- 1 root root 834850409 Feb 12 21:03 test.svs

— Reply to this email directly, view it on GitHub https://github.com/DigitalSlideArchive/digital_slide_archive/issues/318#issuecomment-1943578639, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACC76RYACFSERPADMLGOJ43YTSNTVAVCNFSM6AAAAABDFXPDICVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNBTGU3TQNRTHE . You are receiving this because you were mentioned.Message ID: @.*** com>

-- David Manthey Technical Leader Kitware Inc. (518) 881-4439

codybum commented 7 months ago

@manthey I think I have finally poked around enough to see what is going on. I have a test suite that replicates in part how the Girder (file -> file model -> file handle -> abstract asset -> s3 asset -> s3 -> file) process. I posted an issue on the Girder repo in case it is an actual issue (https://github.com/girder/girder/issues/3513).

We will make sure our S3 server can deal with the existing transfer methods and will modify our instance of Girder if needed. If you feel that the proposed issue is an issue, I would be happy to create a pull request and/or test for an alternative solution.

codybum commented 7 months ago

DSA uses a variety of libraries to read images. Many of these libraries require file-like access to the images, so rather than fetching them directly from S3, we expose the data in girder in a FUSE file system and the image libraries read those files. Some of the image libraries (notably openslide) do a lot of open-seek-read-close on small fragments for files (especially on ndpi files) and these manifest as range requests.

Internally, the FUSE file system uses the python requests library to fetch from s3 with a get call with stream=True. I don't see any code where we explicitly end that call; we let the python garbage collector do whatever it does and trust the requests library to do the right thing. Perhaps this should be explicitly closed, but it isn't.

There is an option to enable diskcache on the mount (see https://github.com/DigitalSlideArchive/digital_slide_archive/blob/master/devops/dsa/docker-compose.yml#L26-L28); this will make 128kB range requests (I think) rather than really small range requests and cache the results so you end up with vastly fewer partial requests.

I would expect reading the first bytes of the file, even if the image library doesn't mean to read the whole thing, will appear as a full object request (since from a file system level it is an open without seek). This is dependent on the image library doing the reading as well as the file format.

I'm not sure how to proxy requests to one S3 server to another.

Is this the correct way to enable caching: DSA_USER=$(id -u):$(id -g) DSA_GIRDER_MOUNT_OPTIONS="-o diskcache,diskcache_size_limit=2147483648" docker-compose up

I think I have caching enabled, I see new files and directories being created in: ~/.cache/girder-mount, but there is only 20M or so being cached out of many G being transfered.

It looks like the 128k you mentioned is the size of the cache chunks, but this does not seem to have any impact on the size of the S3 requests: https://github.com/girder/girder/blob/89ab9976b1b085df279c9082b2df43ab7e24cd60/girder/cli/mount.py#L106

It is possible I don't have caching configured correctly.

manthey commented 7 months ago

Yes, the diskcache 128k is the granularity of the cache. I'd expect the requests to S3 to all start a byte multiples of 128k and request lengths that are either unbound or multiples of 128k. I'll set up a local/minio S3 mount to see if I can get the same results as you.

codybum commented 5 months ago

@manthey you were right about the 128k request. I could not initially see the request size because of the Girder issue reported here: https://github.com/girder/girder/issues/3513

If we set the download with the endbyte, in this case offset + 128k, we see 128k ranged requests. If we don't set the endbyte on read, we see offset + file_size requests.

codybum commented 5 months ago

@manthey things are working, but are a bit slow, and I wonder if the number of concurrent request are constrained. Our test VM only has two vCPU, so I suspect this is an issue. Is there a setting for concurrent Girder/Fuse request that I might experiment with? The VM is not under load, but the number of request to S3 seems gated.