Lightning-AI / litdata

Streamline data pipelines for AI. Process datasets across 1000s of machines, and optimize data for blazing fast model training.
Apache License 2.0
249 stars 24 forks source link

Add utility to merge datasets together #190

Closed tchaton closed 3 days ago

tchaton commented 3 days ago
Before submitting - [ ] Was this discussed/agreed via a Github issue? (no need for typos and docs improvements) - [ ] Did you read the [contributor guideline](https://github.com/Lightning-AI/lit-data/blob/main/.github/CONTRIBUTING.md), Pull Request section? - [ ] Did you make sure to update the docs? - [ ] Did you write any new necessary tests?

This PR enables merging optimized datasets together.

 Create 2 different datasets

from litdata import optimize, StreamingDataset

def compress(index):
    return index, index**2

if __name__ == "__main__":
    # Add some data
    optimize(
        fn=compress,
        inputs=list(range(100)),
        output_dir="/teamspace/s3_connections/laoin-400m/folder_1",
        chunk_bytes="64MB",
    )
from litdata import optimize, StreamingDataset

def compress(index):
    return index, index**2

if __name__ == "__main__":
    # Add some data
    optimize(
        fn=compress,
        inputs=list(range(100)),
        output_dir="/teamspace/s3_connections/laoin-400m/folder_2",
        chunk_bytes="64MB",
    )

 Merged into a third one

from litdata import merge_datasets

merge_datasets(
    input_dirs=[
        "/teamspace/s3_connections/laoin-400m/folder_1",
        "/teamspace/s3_connections/laoin-400m/folder_2"
    ],
    output_dir="/teamspace/s3_connections/laoin-400m/folder_3"
)

What does this PR do?

Fixes # (issue).

PR review

Anyone in the community is free to review the PR once the tests have passed. If we didn't discuss your PR in GitHub issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃