Lightning-AI / litdata

Streamline data pipelines for AI. Process datasets across 1000s of machines, and optimize data for blazing fast model training.
Apache License 2.0
249 stars 24 forks source link

Feat: Append data to pre-optimize dataset #180

Closed deependujha closed 5 days ago

deependujha commented 1 week ago
Before submitting - [x] Was this discussed/agreed via a Github issue? (no need for typos and docs improvements) - [ ] Did you read the [contributor guideline](https://github.com/Lightning-AI/lit-data/blob/main/.github/CONTRIBUTING.md), Pull Request section? - [ ] Did you make sure to update the docs? - [x] Did you write any new necessary tests?

What does this PR do?

Fixes #23

A test to understand this feature best is:

# tests/processing/test_functions.py file

def test_optimize_function_modes(tmpdir):
    output_dir = tmpdir.mkdir("output")
    output_dir = str(output_dir)

    def compress(index: int) -> Tuple[int, int]:
        return (index, index ** 2)

    def different_compress(index: int)->  Tuple[int, int, int]:
        return (index, index ** 2, index**3)

    # none mode
    optimize(
        fn=compress,
        inputs=list(range(1, 101)),
        output_dir = output_dir,
        chunk_bytes="64MB",
    )

    my_dataset = StreamingDataset(output_dir)
    assert len(my_dataset) == 100
    assert my_dataset[:] == [(i, i**2) for i in range(1, 101)]

    # append mode
    optimize(
        fn=compress,
        mode = "append",
        inputs=list(range(101, 201)),
        output_dir=output_dir,
        chunk_bytes="64MB",
    )

    my_dataset = StreamingDataset(output_dir)
    assert len(my_dataset) == 200
    assert my_dataset[:] == [(i, i**2) for i in range(1, 201)]

    # overwrite mode
    optimize(
        fn=compress,
        mode = "overwrite",
        inputs=list(range(201, 351)),
        output_dir=output_dir,
        chunk_bytes="64MB",
    )

    my_dataset = StreamingDataset(output_dir)
    assert len(my_dataset) == 150
    assert my_dataset[:] == [(i, i**2) for i in range(201, 351)]

    # failing case
    with pytest.raises(ValueError, match="The config of the optimized dataset is different from the original one."):
        # overwrite mode
        optimize(
            fn=different_compress,
            mode = "overwrite",
            inputs=list(range(201, 351)),
            output_dir=output_dir,
            chunk_bytes="64MB",
        )

PR review

Anyone in the community is free to review the PR once the tests have passed. If we didn't discuss your PR in GitHub issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

codecov[bot] commented 1 week ago

Codecov Report

Attention: Patch coverage is 90.76923% with 6 lines in your changes missing coverage. Please review.

Please upload report for BASE (main@d5eff39). Learn more about missing BASE report.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #180 +/- ## ===================================== Coverage ? 78% ===================================== Files ? 33 Lines ? 4380 Branches ? 0 ===================================== Hits ? 3410 Misses ? 970 Partials ? 0 ```
deependujha commented 1 week ago

It needs to be tested to see if it works for S3, as it requires pro account to set up s3 connection with LightningAI studio.

But, if it works, then one improvement can be done:

If the incompatible (different config) datasets are appended or overwritten, the current behavior is to do all the operations and then in the end, try to merge/ overwrite them.

So, the better approach will be to check for compatibility initially.


Also, tests are passing in the lightning studio. I don't know why they failed here in mac and windows.