Open Woodr7 opened 3 weeks ago
Hi! thanks for your contribution!, great first issue!
Hey @Woodr7
You could do something like this:
from torchvision.datasets import ImageFolder
from litdata import optimize
dataset = ImageFolder("/teamspace/s3_connections/imagenet-tiny/train")
def fn(index):
return dataset[index]
if __name__ == "__main__":
optimize(
fn=fn,
inputs=[i for i in range(len(dataset))],
output_dir="./optimized_imagenet_tiny/train",
chunk_bytes="64MB"
)
Yes, we will add more examples.
🚀 Feature
Within the readme there should be examples, or links to examples, of how to reformat a dataset, starting with imagenet-tiny, in order to make it work well with LitData. How can I take a file structure where each image is organized into a folder named as its associated class and change it so when it's processed with Litdata, all of the relevant information is contained in the noew structure. Then, How do I need to change the code I used to train before in order to use the newly optimized litdata.
Motivation
This is needed in order to make litdata self serve. There is not a good plain english example of going from one simple, understandable dataset type and codebase, to an optimized litdata dataset and the new codebase needed to use that dataset and train the same model 20x faster. We will see more adoption if there is an example of this for as many dataset types as possible.
Pitch
Starting with the existing imagenet-tiny. Should how you go form the current file structure to the filestructure neccesary to run ld.optimize and maintain all of the necessary info. Then show an example of how you need to change the training code in order to take advantage of the optimized cloud dataset.