Lightning-AI / litdata

Transform datasets at scale. Optimize datasets for fast AI model training.
Apache License 2.0
334 stars 39 forks source link

compatibility with downloading data from gcp #154

Closed dangthatsright closed 3 months ago

dangthatsright commented 3 months ago

Similar to the S3 data downloading, this works for GCP. I've tested it with my own data but not sure how I would set up tests for this repo.

The link to the contributor guideline is dead btw.

If there are specific places you would like me to update, please let me know.

Before submitting - [ ] Was this discussed/agreed via a Github issue? (no need for typos and docs improvements) - [ ] Did you read the [contributor guideline](https://github.com/Lightning-AI/lit-data/blob/main/.github/CONTRIBUTING.md), Pull Request section? - [ ] Did you make sure to update the docs? - [ ] Did you write any new necessary tests?

What does this PR do?

Fixes #94 #101

PR review

Anyone in the community is free to review the PR once the tests have passed. If we didn't discuss your PR in GitHub issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

codecov[bot] commented 3 months ago

Codecov Report

Attention: Patch coverage is 89.28571% with 3 lines in your changes missing coverage. Please review.

Please upload report for BASE (main@6519a98). Learn more about missing BASE report.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #154 +/- ## ===================================== Coverage ? 77% ===================================== Files ? 30 Lines ? 4124 Branches ? 0 ===================================== Hits ? 3184 Misses ? 940 Partials ? 0 ```
tchaton commented 3 months ago

Hey @dangthatsright. I have fixed the failure in the test. When you import an optional package, you need to make sure the logic is protected by a boolean to avoid failures when it is not available.

I wondered if you would be keen on contributing the same feature for the optimize method, so users can generate their data and push directly to gcs.

Something like this:


optimize(
    ....
    output_dir="gcs://....",
) 

Also, feel free to join our discord: https://discord.gg/fVrkqu5g. We have channel called #litdata to talk about the future of the library.