Hi @julianesiebourg, related to your group shuffle use-case and #51 .
I wanted to capture directions on which we can improve the library to accommodate for this case.
Do you think this is a common problem? Intuitively I see that keeping replicates distributed across batches is usually a better strategy.
We can introduce a simple scoring function which will penalize separated samples and use strict improvement. The downside is that it will make shuffling very difficult. I.e., if you want samples 1 and 2 moved from batch X to batch Y, a) we would need to shuffle at least two samples in one iteration (n_shuffle >= 2), b) 1 and 2 should be chosen together at random (quite unlikely) c) destination should be the same batch (probability 1/n_batches). So probably shuffling will be very slow.
Some shuffle with constraints procedure could be a solution. We could try to generalize the example I shared with you. Basically by specifying what's the sample group column. The difficulty is that we could run into a pathological configuration from which there is no way back (without breaking the constraint).
At this stage I think we should just capture this and maybe if you already have an idea of how frequent this is and what other types of group shuffle we might need that would be great.
Hi @julianesiebourg, related to your group shuffle use-case and #51 .
I wanted to capture directions on which we can improve the library to accommodate for this case.
n_shuffle
>= 2), b) 1 and 2 should be chosen together at random (quite unlikely) c) destination should be the same batch (probability1/n_batches
). So probably shuffling will be very slow.At this stage I think we should just capture this and maybe if you already have an idea of how frequent this is and what other types of group shuffle we might need that would be great.