datamol-io / splito

Machine Learning dataset splitting for life sciences.
https://splito-docs.datamol.io/
Apache License 2.0
23 stars 2 forks source link

Add Lo splitter #9

Closed SteshinSS closed 5 months ago

SteshinSS commented 7 months ago

Hi, all. @cwognum suggested adding Lo-Hi splitter into Splito. Here is the first part of it.

Changelogs

If the code seems ok, I'll add documentation and update this PR.


Checklist:


SteshinSS commented 7 months ago

Also, comment please on the code style. I didn't use any formatter yet.

SteshinSS commented 7 months ago

Thank you for the thorough review. The code is very complex, so any way to simplify it would be helpful. I'll return with the revision.

SteshinSS commented 7 months ago

Changes in the new commit:

Please review this version, and if it looks fine, I will include documentation in this PR.

SteshinSS commented 6 months ago

Thanks, guys, for the review! It seems that the PR has mostly converged. In this case, I'm going to add documentation to this PR.

Please check my old tutorial on the Lo splitter. I'm going to add more details, remove the k-fold examples, and rewrite the code using Splito and datamol datasets. What do you think about that? Is there anything that you would like to see in the documentation? Is there anything that could be clearer?

cwognum commented 6 months ago

Hi @SteshinSS, thanks for the nice work so far! As to the documentation, I would focus on the why rather than on the what. For a large part I find the code self-explanatory and won't need comments that explain me what's happening. Instead, I would like to understand why it has been implemented in a certain way and - zooming out - what the reasoning is behind this splitting method to begin with. You can of course always refer to your paper, but I would assume people have not read it.

SteshinSS commented 6 months ago

Done :)

cwognum commented 6 months ago

@SteshinSS Maybe worth saying that I didn't test the suggested changes above. It's from the top of my head.

SteshinSS commented 5 months ago

I just checked, and I'm not sure we want to add LoSplitter into train_test_split().

There are many splitters with a similar interface, accepting test_size and seed as parameters and returning an iterator. This function allows them to work in the same manner, which might be beneficial. However, LoSplitter:

If we add LoSplitter to the train_test_split(), we will need:

As an alternative, train_test_split() could take **kwargs and pass them into splitters as _ = train_test_split(X, y, method, **kwargs). However, this approach kinda make the train_test_split() pointless because users could simply use splitters directly.

Have you considered a functional approach? For example, splito could offer not only classes but also pure functions for splitting, similar to torch.nn.functional

cwognum commented 5 months ago

I like the idea of a functional module! That's indeed maybe the better way to go. Thanks for the elaborate suggestion and considerations!

However, I don't want to scope creep this PR too much, so could you maybe open an issue with your idea and then we can merge this PR! Thanks again 🙏

cwognum commented 5 months ago

Please just make sure the code is properly formatted with Ruff!

SteshinSS commented 5 months ago

I believe it could be merged :)