argonne-lcf / dlio_benchmark

An I/O benchmark for deep Learning applications
https://dlio-benchmark.readthedocs.io
Apache License 2.0
65 stars 30 forks source link

Adding S3 support when PyTorch framework is selected. #138

Open zhenghh04 opened 9 months ago

zhenghh04 commented 9 months ago

Check whether we can adopt the PyTorch S3 support: https://pytorch.org/data/main/generated/torchdata.datapipes.iter.S3FileLoader.html

zhenghh04 commented 9 months ago

@hariharan-devarajan could you take a look whether this is good to include?

hariharan-devarajan commented 9 months ago

I think it is good but the only concern I have that in PyTorch data loaders, Interable input pipelines are less parallelizable than indexed. We can probably convert this into a indexed pipeline

We can use get_object and put_object To build our own pipeline and compare against a iterable version using our native data loader implementations.

krehm commented 9 months ago

FWIW, I have had S3 working for a while now with torch in my test setup, but I used a different method. My task was to get DAOS working with fsspec so that DAOS pathnames can be used with DLIO without requiring the dfuse layer. I modified the readers and generators to open files with fsspec rather than with (kernel-only) pathnames, then everything else after that works the same, but now I can provide paths like s3:://my-bucket/my-file and daos::/my-pool/my-cont/my-file and use them with DLIO. There are lots of other backends available with fsspec besides S3 and DAOS (and posix files) that would automatically work with DLIO.

zhenghh04 commented 9 months ago

@krehm would you mind sending a PR?

krehm commented 9 months ago

See the following, it is a bit out of date, but should give you the idea.

https://github.com/argonne-lcf/dlio_benchmark/compare/main...krehm:dlio_benchmark:feature/fsspec-storage

hariharan-devarajan commented 9 months ago

@zhenghh04 I have seen this PR, I think, we should create a PR and then I can take a look at it.

From my memory, the main thing is that our storage interface is a little weird right now. Ideally all I/O should happen through storage interface and the storage interface should support fsspec for different options of backends.

But I am on-board with the fsspec approach for sure.

krehm commented 9 months ago

I will work on cleaning up the code and making a PR, seems to me that there were a couple of loose ends when I last tested with it, I need to dust off my notes. Note also that I will be in a car Wednesday through Friday, so I will be unresponsive until early next week.