Closed awsankur closed 8 months ago
Hi @awsankur, thanks for your report!
The issue you are running into is caused by the webdataset.tariterators.tar_file_expander
function returning an iterable rather than a Dataset
that the DataLoader
constructor requires.
The external torchdata
package we use in the example provides an IterableWrapper
which can be used to fix the issue:
...
s3_dataset = s3torchconnector.S3IterableDataset.from_prefix(IMAGES_URI, region=REGION, transform=shard_to_dict)
tar_dataset = webdataset.tariterators.tar_file_expander(s3_dataset)
dataset = torchdata.datapipes.iter.IterableWrapper(tar_dataset)
loader = torch.utils.data.DataLoader(dataset, batch_size=4)
...
Note that this uses an S3IterableDataset
to start with rather than a S3MapDataset
, as by converting with IterableWrapper
you lose all the advantages S3MapDataset
provides even though both will function.
Please let us know if this resolves your issue.
Thanks. It works
We've updated the example file to include a more concrete demonstration of how to use the tar file expander with dataloaders.
s3torchconnector version
1.1.0
s3torchconnectorclient version
1.1.0
AWS Region
us-west-2
Describe the running environment
Running in EC2 m5.8xlarge Amazon Linux 2
What happened?
I am following the example notebook here: https://github.com/awslabs/s3-connector-for-pytorch/blob/main/examples/Getting%20s[…]ed%20with%20the%20Amazon%20S3%20Connector%20for%20PyTorch.ipynb
I need to modify it to load a dataset where multiple images are sharded in hundreds of .tar files. An example dataset is the laion-art dataset. I am using the following code:
But I get the error:
TypeError: object of type 'generator' has no len()
Relevant log output
Code of Conduct