Lightning-AI / litdata

Streamline data pipelines for AI. Process datasets across 1000s of machines, and optimize data for blazing fast model training.
Apache License 2.0
249 stars 24 forks source link

train_test_split doesn't support split of 0.0 #182

Closed robmarkcole closed 4 days ago

robmarkcole commented 5 days ago

🐛 Bug

The check 0 < _f <= 1 is failed for value of 0.0

To Reproduce

Pass a value of 0.0 as a split to train_test_split

train_test_split(dataset, splits=[0.0, 0.0, 1.0])

Expected behavior

I can have a split of 0.0

Environment

Master

deependujha commented 5 days ago

Hi @robmarkcole, thanks for pointing out the issue.

I'm working on another issue. When the PR is ready to be merged, I'll try to fix this issue too. I don't think fixing this will require much work to be done.

deependujha commented 4 days ago

Btw, why would someone even want a split of 0.0?

This even makes sense: [0.01, 0.01, 0.98] and it works fine.

If I remember correctly, Luca added the condition for each split to be greater than 0, while reviewing the PR.

if not all(0 < _f <= 1 for _f in splits):
        raise ValueError("Each Split should be a float with each value in [0,1].")
robmarkcole commented 4 days ago

I have a single dataset and typically random split it. However I also sometimes want to just test on it, so test weighting is 100%

deependujha commented 4 days ago

Okay, I was thinking of just updating all(0 **<=** _f <= 1 for _f in splits) will do the work, but, I also need to make some changes internally.

I'll try fixing it as soon as possible. Btw, if you've used train_test_split and any issues encountered, plz mention it here in the same thread. It'll be easier to club them and fix them at once.