Clear Examples of use with different dataset types and code changes.

Woodr7 commented 3 weeks ago

🚀 Feature

Within the readme there should be examples, or links to examples, of how to reformat a dataset, starting with imagenet-tiny, in order to make it work well with LitData. How can I take a file structure where each image is organized into a folder named as its associated class and change it so when it's processed with Litdata, all of the relevant information is contained in the noew structure. Then, How do I need to change the code I used to train before in order to use the newly optimized litdata.

Motivation

This is needed in order to make litdata self serve. There is not a good plain english example of going from one simple, understandable dataset type and codebase, to an optimized litdata dataset and the new codebase needed to use that dataset and train the same model 20x faster. We will see more adoption if there is an example of this for as many dataset types as possible.

Pitch

Starting with the existing imagenet-tiny. Should how you go form the current file structure to the filestructure neccesary to run ld.optimize and maintain all of the necessary info. Then show an example of how you need to change the training code in order to take advantage of the optimized cloud dataset.

github-actions[bot] commented 3 weeks ago

Hi! thanks for your contribution!, great first issue!

tchaton commented 3 weeks ago

Hey @Woodr7

You could do something like this:

from torchvision.datasets import ImageFolder
from litdata import optimize

dataset = ImageFolder("/teamspace/s3_connections/imagenet-tiny/train")

def fn(index):
    return dataset[index]

if __name__ == "__main__":
    optimize(
        fn=fn,
        inputs=[i for i in range(len(dataset))],
        output_dir="./optimized_imagenet_tiny/train",
        chunk_bytes="64MB"
    )

Yes, we will add more examples.

Lightning-AI / litdata