Open zhenghh04 opened 9 months ago
@hariharan-devarajan could you take a look whether this is good to include?
I think it is good but the only concern I have that in PyTorch data loaders, Interable input pipelines are less parallelizable than indexed. We can probably convert this into a indexed pipeline
We can use get_object and put_object To build our own pipeline and compare against a iterable version using our native data loader implementations.
FWIW, I have had S3 working for a while now with torch in my test setup, but I used a different method. My task was to get DAOS working with fsspec so that DAOS pathnames can be used with DLIO without requiring the dfuse layer. I modified the readers and generators to open files with fsspec rather than with (kernel-only) pathnames, then everything else after that works the same, but now I can provide paths like s3:://my-bucket/my-file and daos::/my-pool/my-cont/my-file and use them with DLIO. There are lots of other backends available with fsspec besides S3 and DAOS (and posix files) that would automatically work with DLIO.
@krehm would you mind sending a PR?
See the following, it is a bit out of date, but should give you the idea.
@zhenghh04 I have seen this PR, I think, we should create a PR and then I can take a look at it.
From my memory, the main thing is that our storage interface is a little weird right now. Ideally all I/O should happen through storage interface and the storage interface should support fsspec for different options of backends.
But I am on-board with the fsspec approach for sure.
I will work on cleaning up the code and making a PR, seems to me that there were a couple of loose ends when I last tested with it, I need to dust off my notes. Note also that I will be in a car Wednesday through Friday, so I will be unresponsive until early next week.
Check whether we can adopt the PyTorch S3 support: https://pytorch.org/data/main/generated/torchdata.datapipes.iter.S3FileLoader.html