LabeliaLabs / distributed-learning-contributivity

Simulate collaborative ML scenarios, experiment multi-partner learning approaches and measure respective contributions of different datasets to model performance.
https://www.labelia.org
Apache License 2.0
57 stars 12 forks source link

Titanic example fails if minibatch contains only one class #257

Closed RomainGoussault closed 3 years ago

RomainGoussault commented 4 years ago

When running the titanic example with this config file: https://github.com/SubstraFoundation/distributed-learning-contributivity/blob/3ab411b6e79fcb34e71294b8c6a2ae98bdf1f8c7/tests/config_end_to_end_test_titanic.yml, it fails (see stacktrace below).

The issue is that sometimes the data we fit only contains one class and the sklearn solver does not like that and always want to have 2 class. https://stackoverflow.com/questions/40524790/valueerror-this-solver-needs-samples-of-at-least-2-classes-in-the-data-but-the

Note that in this specific example (dataset_proportion=0.2 and minibatch_cout=10), the batch size is 4 which very small.


2020-10-19 23:39:19.988 | INFO     | mplc.multi_partner_learning:compute_collaborative_round_fedavg:304 - (fedavg) Minibatch n°6 of epoch n°1, init aggregated model for each partner with models from previous round
2020-10-19 23:39:19.989 | ERROR    | __main__:<module>:115 - An error has been caught in function '<module>', process 'MainProcess' (73989), thread 'MainThread' (4422049216):
Traceback (most recent call last):

> File "main.py", line 115, in <module>
    main()
    └ <function main at 0x1156ee170>

  File "main.py", line 78, in main
    current_scenario.run()
    │                └ <function Scenario.run at 0x14afb37a0>
    └ <mplc.scenario.Scenario object at 0x149a5ba50>

  File "/Users/rgoussault/substra/distributed-learning-contributivity/mplc/scenario.py", line 866, in run
    self.mpl.compute_test_score()
    │    │   └ <function MultiPartnerLearning.compute_test_score at 0x125fa0290>
    │    └ <mplc.multi_partner_learning.MultiPartnerLearning object at 0x14fbff810>
    └ <mplc.scenario.Scenario object at 0x149a5ba50>

  File "/Users/rgoussault/substra/distributed-learning-contributivity/mplc/multi_partner_learning.py", line 165, in compute_test_score
    self.compute_collaborative_round_fedavg()
    │    └ <function MultiPartnerLearning.compute_collaborative_round_fedavg at 0x14673df80>
    └ <mplc.multi_partner_learning.MultiPartnerLearning object at 0x14fbff810>

  File "/Users/rgoussault/substra/distributed-learning-contributivity/mplc/multi_partner_learning.py", line 324, in compute_collaborative_round_fedavg
    partner_model, train_data_for_fit_iteration, self.val_data, partner.batch_size)
    │              │                             │    │         │       └ 1
    │              │                             │    │         └ <mplc.partner.Partner object at 0x14b116a50>
    │              │                             │    └ (array([[436, False, 31.0, 10.5, 0, 37, True, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    │              │                             │              0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0],
    │              │                             │             [60...
    │              │                             └ <mplc.multi_partner_learning.MultiPartnerLearning object at 0x14fbff810>
    │              └ (array([[92, False, 26.0, 20.575, 3, 22, False, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    │                        0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1],
    │                       [...
    └ LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                         intercept_scaling=1, l1_ratio...

  File "/Users/rgoussault/substra/distributed-learning-contributivity/mplc/multi_partner_learning.py", line 578, in fit_model
    history = model_to_fit.fit(x_train, y_train)
              │            │   │        └ array([0, 0, 0, 0])
              │            │   └ array([[92, False, 26.0, 20.575, 3, 22, False, 0, 0, 0, 0, 0, 0, 0, 0, 0,
              │            │             0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1],
              │            │            [9...
              │            └ <function LogisticRegression.fit at 0x125f849e0>
              └ LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                                   intercept_scaling=1, l1_ratio...

  File "/Users/rgoussault/env/contrib/lib/python3.7/site-packages/sklearn/linear_model/_logistic.py", line 1558, in fit
    " class: %r" % classes_[0])
                   └ array([0])

ValueError: This solver needs samples of at least 2 classes in the data, but the data contains only one class: 0

@arthurPignet @bowni

arthurPignet commented 4 years ago

This means that we cannot create scenarios where each partner owns specific data. It could be great to add a warning, an error, and/or a sentence about that in the doc

bowni commented 3 years ago

I'll add a note in the documentation @RomainGoussault. I don't think it should be "fixed".