awslabs / mountpoint-s3

A simple, high-throughput file client for mounting an Amazon S3 bucket as a local file system.
Apache License 2.0
4.42k stars 149 forks source link

Add ML benchmarks #369

Open tchaton opened 1 year ago

tchaton commented 1 year ago

Tell us more about this new feature.

Hey there,

Awesome initiative. I have been looking for this for a long time.

I would be very interested to see how this compare to leading frameworks like RClone or GeeseFS on resized 256x256 Imagenet or Kinetics 700.

I think benchmarks are key for adoption of this new library and should be placed on the README for visibility. Best, T.C

monthonk commented 1 year ago

Thanks for your suggestion. We will look into running real workloads for the benchmark. In the meantime, we have some basic file operation benchmarks that run on every commit and you can see the results at https://awslabs.github.io/mountpoint-s3/dev/bench/ and https://awslabs.github.io/mountpoint-s3/dev/latency_bench/. We also documented it in BENCHMARKING.

tchaton commented 1 year ago

Hey @monthonk Sounds good. I strongly recommend to come up with heavy machine learning benchmarks before going out of alpha.

With Rclone, Geesefs, etc.... They are some caviats at scale where files disappear and we need to make DataLoaders implement some sort of retries or the mount itself fails. So far, direct s3 download has been the most reliable & fatest in our test but this requires code changes.

I think it would be great if mountpoint-s3 could alleviate those.

Here are some extra feature requests:

I think this would send a very positive signal to the industry this tool is ready be used in production.

Furthermore, I am available to chat if you want to get some user feedbacks.

Best, T.C

jamesbornholt commented 1 year ago

Thanks for the super helpful feedback! Some of this stuff is already on our radar, including benchmarking, and we'll look into the others. We generally track future features on our public roadmap.

A couple of specific notes:

Implement multi part download to speed up training with very large files

Mountpoint has this one already, using the AWS Common Runtime. reads of files in Mountpoint are automatically streamed with concurrent ranged GetObject requests.

Enable to decide on the fly if a file should be cached or not

We're tracking ideas like this one in #255.

A python async/sync os library with resilient reading and multi part reading to improve speed and robutness.

Mountpoint itself already handles retries and multi-part internally. Do you think anything else is necessary here from the application side? I'm not sure what to expect from e.g. PyTorch DataLoaders here.

Thanks again!

tchaton commented 1 year ago

Dear @jamesbornholt,

Thanks for the reply !

Mountpoint itself already handles retries and multi-part internally. Do you think anything else is necessary here from the application side? I'm not sure what to expect from e.g. PyTorch DataLoaders here.

Yes, this is great ! However, I believe you can squeeze more performance by having a s3-mountpoint client. It is very hard to get the exact right tuning for all use cases. I think providing a versatile client would go a long way. Similar to Rclone RC but with a smarter API and finer control.

A user case would a custom DataLoader using the s3-mountpoint client to request pre-fetched files, better async io transfer from the mountpoint, etc... reference: https://github.com/Lightning-AI/lightning/blob/master/src/lightning/data/datasets/mapping.py#L11. This is the PyTorch Lightning Dataset. We are using fspeec to fetch the files from s3.

Another application would be to have a more granular control on which files should be refreshed.

https://github.com/Lightning-AI/lightning/blob/master/src/lightning/data/datasets/mapping.py#L11

tchaton commented 1 year ago

Another suggestion is to provide a way to disable the HeadBucket request. This doesn't work well for us. We had to do some patching.

dannycjones commented 1 year ago

Hey @tchaton,

Regarding the HeadBucket request, we replaced it with a ListObjectsV2 request in df4087bd63de7ff31984d9cc0e4a0db951359c11 about two weeks ago to help support customers who didn't want to or could not grant the HeadBucket permission. v1.0.0 contains this change.

Does that work for you?

jamesbornholt commented 1 year ago

Yes, this is great ! However, I believe you can squeeze more performance by having a s3-mountpoint client. It is very hard to get the exact right tuning for all use cases. I think providing a versatile client would go a long way. Similar to Rclone RC but with a smarter API and finer control.

A user case would a custom DataLoader using the s3-mountpoint client to request pre-fetched files, better async io transfer from the mountpoint, etc... reference: https://github.com/Lightning-AI/lightning/blob/master/src/lightning/data/datasets/mapping.py#L11. This is the PyTorch Lightning Dataset. We are using fspeec to fetch the files from s3.

Another application would be to have a more granular control on which files should be refreshed.

https://github.com/Lightning-AI/lightning/blob/master/src/lightning/data/datasets/mapping.py#L11

Ah, I see, that makes sense — thanks! We're definitely thinking about whet the right thing to do here and whether the smarts should live in Mountpoint itself versus somewhere closer to the application.

We've started working on some (very synthetic) PyTorch benchmarks here: https://github.com/awslabs/mountpoint-s3/pull/440, but there's probably a bunch more we could do there.

tchaton commented 1 year ago

Hey @tchaton,

Regarding the HeadBucket request, we replaced it with a ListObjectsV2 request in df4087b about two weeks ago to help support customers who didn't want to or could not grant the HeadBucket permission. v1.0.0 contains this change.

Does that work for you?

Hey @dannycjones . Our challenge was: Our EC2 IAM role have restricted access to sub keys on s3 like s3://{BUCKET_NAME}/{SOME_PATH} and the HeadBucket blocked us to even try as it would fail due to lack of permission. We were providing the region as an environment variable, but somehow it wasn't picked up.

Great to hear ! We will give it another try and come back you ;)

tchaton commented 1 year ago

Dear @jamesbornholt,

Thanks for using PyTorch Lightning in your benchmark example. I am one of the core developer of the PyTorch Lightning framework, so there is probably room for collaboration there as we benchmarked heavily all the other mounting solutions.

Here are some metrics we are looking on closely when benchmarking as a function of (num_workers, batch_size) on Imagenet resized (256, 256):

Concerning 1), this can be optimised by providing an index file to avoid listing s3 on start, especially when the dataset is fixed and very large (million of files). This is supported in the LightningDataset: https://github.com/Lightning-AI/lightning/blob/master/src/lightning/data/datasets/mapping.py#L29). It would be fantastic if mountpoint-s3 added 4 new arguments to help with this:

Using multi node, this would enable users to provide sharded index to reduce even more the in-memory index kept by mountpoint-s3.

Ah, I see, that makes sense — thanks! We're definitely thinking about whet the right thing to do here and whether the smarts should live in Mountpoint itself versus somewhere closer to the application.

That's a great consideration. Other solutions tend to be black box and very hard to customise. I think a solution that can be customised post start would go a long way.

I can imagine other examples we'd like to build and stick in this new examples directory: ffmpeg (like the blog post), genomics stuff

I would like to suggest an example using our optimised LLM repo: https://github.com/Lightning-AI/lit-gpt with the Red Pajama dataset: https://github.com/togethercomputer/RedPajama-Data. Those are large files and it is currently a challenging problem for the community training LLM right now using multi node.

Finally, I found it was possible to get a 40 % speed up fsspec using asyncio by relying on the getitems function of PyTorch Dataset. Very likely, it is possible to implement an asynchronous version of Dataset using https://github.com/Tinche/aiofiles to make mountpoint-s3 shines even more. Even more speedup can be added by adding FFCV: https://github.com/libffcv/ffcv.

Best regards, Thomas Chaton.

dannycjones commented 1 year ago

Hey @tchaton, Regarding the HeadBucket request, we replaced it with a ListObjectsV2 request in df4087b about two weeks ago to help support customers who didn't want to or could not grant the HeadBucket permission. v1.0.0 contains this change. Does that work for you?

Hey @dannycjones . Our challenge was: Our EC2 IAM role have restricted access to sub keys on s3 like s3://{BUCKET_NAME}/{SOME_PATH} and the HeadBucket blocked us to even try as it would fail due to lack of permission. We were providing the region as an environment variable, but somehow it wasn't picked up.

Great to hear ! We will give it another try and come back you ;)

Absolutely, this is the kind of feedback we got and why we moved from the HeadBucket to ListObjectsV2 check. We also added support for AWS_REGION environment variable in ae18473cf5668df8fd6e09b49b7a665464786b8b, which is part of v1.0.

tchaton commented 1 year ago

@dannycjones @jamesbornholt

I would like to make one extra recommendation as an open source developer ;)

You should adopt a CHANGELOG.md for thsi repo as we do in PyTorch Lightning: https://github.com/Lightning-AI/lightning/blob/03ca31c3d3a2c4cb5633f14c8275767cdaf0795a/src/lightning/fabric/CHANGELOG.md#L4 with PR references. You can then include it within your release notes: https://github.com/Lightning-AI/lightning/releases/tag/2.0.0.

Without it, it is very hard for users to understand exactly what went into a release and the trade-offs of some changes.

Best, Thomas Chaton