Approximation of amounts_per_partner

kat-leen commented 3 years ago

When attributing an infinite fraction to the amounts_per_partner parameter, the rounding is not very close to the true value and the sum is not always equal to 1.

For example: amounts_per_partner=[0.8/9.0]*9 + [0.1]*2 returns: Partners' relative number of samples : [0.09, 0.09, 0.09, 0.09, 0.09, 0.09, 0.09, 0.09, 0.09, 0.1, 0.1] (versus initially configured: [0.08888888888888889, 0.08888888888888889, 0.08888888888888889, 0.08888888888888889, 0.08888888888888889, 0.08888888888888889, 0.08888888888888889, 0.08888888888888889, 0.08888888888888889, 0.1, 0.1]) The fractions sums up to 1.01.

bowni commented 3 years ago

Investigation needed to see if this is related to the weird data distribution in the reference scenario 1, where partner 1 gets a few sample of the 8th class.

bowni commented 3 years ago

OK, so I think I can explain both weird observations. They seem actually expected, although I confess this is not ideal:

The observation in the description of this issue:
- When using the AdvancedSplitter, it tries to apply the desired split while at the same time enforcing the amounts_per_partner ratios. This makes things a bit complicated. At some point, the precision of the rounding depends on the absolute number of samples availables. Example: 0,0888 * 6000 = 533,28, which gives 533 samples.
- One thing that could be made clearer in the final value obtained once the split is done: that one is rounded to 2 decimals, not ideal
- So in the end, there doesn't seem to be a rounding error affecting the split, only an unsufficiently precise rounding for displaying the final split obtained
The first partner getting some samples of the 8th class in the 1st ref scenario:
- The StratifiedSplitter is dumb, it just:
  1. takes the amounts_per_partner values (0.7 and 0.3 in our example)
  2. computes a cumulated sum (as we only have 2 partners, this only gives [0.7])
  3. multiplies it by the number of samples in the train dataset (54000 with MNIST after 10% is removed for the validation dataset - which gives `0.7 * 54000 = 37800)
  4. orders the samples per class
  5. picks the 37800 first ones
- As we train_test_split the dataset to keep a validation set aside, and this is done with from sklearn.model_selection import train_test_split, we don't have a guarantee that there is going to be exactly the same number of samples per class (i.e. 5400). So we might have slightly few ones in the first 7 classes, and have to pick some from the 8th one
- This could be improved by re-writing the StratifiedSplitter as a pre-configured FlexibleSplitter, which would handle all this much more cleanly

Does this clarify things @Meylina ? cc @JustineBoulant @RomainGoussault @arthurPignet @SaboniAmine @HeytemBou @Thomas-Galtier @celinejacques @jeromechambost

kat-leen commented 3 years ago

Thank you @bowni for both explanations.

Should we not merge the StratifiedSplitter and the AdvancedSplitter using only the samples_split_configuration of the AdvancedSplitter for inferring the amounts_per_partner?

bowni commented 3 years ago

Yes we could this would already be better 😃 However AdvancedSplitter is quite complex, and is itself a specific case of the most generic FlexibleSplitter. Thus I believe that if we engage into rewriting some splitters, we should rewrite them as using FlexibleSplitter.

LabeliaLabs / distributed-learning-contributivity

Approximation of amounts_per_partner #341