adapt-python / adapt

Awesome Domain Adaptation Python Toolbox
https://adapt-python.github.io/adapt/
BSD 2-Clause "Simplified" License
304 stars 43 forks source link

TrAdaBoost random data selection issue #96

Closed WeGlove closed 1 year ago

WeGlove commented 1 year ago

Hello everyone,

I have run into the following problem. I have labeled data X, y and labeled target data Xt, yt and would like to use TrAdaBoost in combination with sklearn RandomForrest.

Now, this works most of the time, however sometimes TrAdaBoost crashes. See this stack trace:

Traceback (most recent call last):
  ...
  File "...\adapt\instance_based\_tradaboost.py", line 256, in fit
    sample_weight_src, sample_weight_tgt = self._boost(
  File "...\adapt\instance_based\_tradaboost.py", line 331, in _boost
    error_vect_src = np.abs(ys_pred - ys).sum(tuple(range(1, ys.ndim))) / 2.
ValueError: operands could not be broadcast together with shapes (80,4) (80,5)

I have done some digging, and was able to pin it to the base.py module, specifically the BaseAdaptEstimator fit_estimator function. Specifically this section seems to be the problem:

sample_weight = check_sample_weight(sample_weight, X)
sample_weight /= sample_weight.sum()
bootstrap_index = np.random.choice(
len(X), size=len(X), replace=True,
p=sample_weight)
with warnings.catch_warnings():
    warnings.simplefilter("ignore")
    self.estimator_.fit(X[bootstrap_index],
                        y[bootstrap_index],
                        **fit_params)

What seems to happen is that np.random.choice sometimes randomly chooses a distribution that does not contain data for one of the classes. Thus, the estimator only returns data fitted on 4 of my 5 classes leading to the problems above.

The data isn't completely balanced if that makes a difference, however there is definitley data for each class.

Am I doing something wrong or is this something that can just happen?

antoinedemathelin commented 1 year ago

Hi @WeGlove, Thank you for your interest in the Adapt library and thank you for reporting this strange bug. Your "debugging" explanation seems interestin, but I am not sure that the problem is here, bcause the RandomForestClassifier object of sklearn has a "sample_weight argument in the "fit" method, which avoids the use of bootstraping.

I tried to reproduced the bug, with unbalanced class dataset but I did not manage to reproduce it.

Can you please tell me wich version of adapt and sklearn you are using? How are encoded your classes, with integer or string? Do you use numpy arrays or pandas DataFrame for the X, y inputs?

Can you please share, a little example where the bug happen (with simulated data and set random seed)?

Best,