Closed kat-leen closed 2 years ago
Investigation needed to see if this is related to the weird data distribution in the reference scenario 1, where partner 1 gets a few sample of the 8th class.
OK, so I think I can explain both weird observations. They seem actually expected, although I confess this is not ideal:
The observation in the description of this issue:
AdvancedSplitter
, it tries to apply the desired split while at the same time enforcing the amounts_per_partner
ratios. This makes things a bit complicated. At some point, the precision of the rounding depends on the absolute number of samples availables. Example: 0,0888 * 6000 = 533,28, which gives 533 samples. The first partner getting some samples of the 8th class in the 1st ref scenario:
StratifiedSplitter
is dumb, it just:
amounts_per_partner
values (0.7 and 0.3 in our example)[0.7]
)from sklearn.model_selection import train_test_split
, we don't have a guarantee that there is going to be exactly the same number of samples per class (i.e. 5400). So we might have slightly few ones in the first 7 classes, and have to pick some from the 8th oneStratifiedSplitter
as a pre-configured FlexibleSplitter
, which would handle all this much more cleanlyDoes this clarify things @Meylina ? cc @JustineBoulant @RomainGoussault @arthurPignet @SaboniAmine @HeytemBou @Thomas-Galtier @celinejacques @jeromechambost
Thank you @bowni for both explanations.
Should we not merge the StratifiedSplitter
and the AdvancedSplitter
using only the samples_split_configuration
of the AdvancedSplitter
for inferring the amounts_per_partner
?
Yes we could this would already be better 😃 However AdvancedSplitter
is quite complex, and is itself a specific case of the most generic FlexibleSplitter
. Thus I believe that if we engage into rewriting some splitters, we should rewrite them as using FlexibleSplitter
.
When attributing an infinite fraction to the amounts_per_partner parameter, the rounding is not very close to the true value and the sum is not always equal to 1.
For example:
amounts_per_partner=[0.8/9.0]*9 + [0.1]*2
returns:Partners' relative number of samples : [0.09, 0.09, 0.09, 0.09, 0.09, 0.09, 0.09, 0.09, 0.09, 0.1, 0.1] (versus initially configured: [0.08888888888888889, 0.08888888888888889, 0.08888888888888889, 0.08888888888888889, 0.08888888888888889, 0.08888888888888889, 0.08888888888888889, 0.08888888888888889, 0.08888888888888889, 0.1, 0.1])
The fractions sums up to 1.01.