Handling too few classes in landmarker cross validation

bjschoenfeld commented 5 years ago

Our landmarkers perform cross validation with 2 folds. Some datasets may have only 1 instance of a particular target class. In this case, the validation in sklearn's cross validation throws an error, requiring at least n_folds (2 in our case) instances of each class. This is not pretty to have such an error thrown. How should we handle this?

emrysshevek commented 5 years ago

Running on LL0_488_colleges_aaup dataset

Traceback (most recent call last):
  File "venv/lib/python3.6/site-packages/metalearn/metafeatures/metafeatures.py", line 113, in compute
    n_folds, verbose
  File "venv/lib/python3.6/site-packages/metalearn/metafeatures/metafeatures.py", line 234, in _validate_compute_arguments
    n_folds, verbose
  File "venv/lib/python3.6/site-packages/metalearn/metafeatures/metafeatures.py", line 348, in _validate_n_folds
    f"{group.shape[0]}."
ValueError: The minimum number of instances in each class of Y is n_folds=2. Class VIIB has 1.

bjschoenfeld commented 5 years ago

Can we compare with OpenML on this?

emrysshevek commented 5 years ago

Similar to this, datasets with fewer than 4 instances per class fail. Should we handle something like this?

import pandas as pd import numpy as np from metalearn import Metafeatures x = pd.DataFrame(np.random.rand(8,2)) y = pd.Series(['a','a','a','b','b','b']) Metafeatures().compute(x,y)

Traceback (most recent call last): File "", line 1, in File "metalearn/metafeatures/metafeatures.py", line 158, in compute value, compute_time = self._get_resource(metafeature_id) File "metalearn/metafeatures/metafeatures.py", line 390, in _get_resource computed_resources = f(args) File "metalearn/metafeatures/landmarking_metafeatures.py", line 72, in get_lda return run_pipeline(X, Y, pipeline, n_folds, cv_seed) File "metalearn/metafeatures/landmarking_metafeatures.py", line 34, in run_pipeline 'accuracy': accuracy_scorer, 'kappa': kappa_scorer File "sklearn/model_selection/_validation.py", line 240, in cross_validate for train, test in cv.split(X, y, groups)) File "sklearn/externals/joblib/parallel.py", line 917, in call if self.dispatch_one_batch(iterator): File "sklearn/externals/joblib/parallel.py", line 759, in dispatch_one_batch self._dispatch(tasks) File "sklearn/externals/joblib/parallel.py", line 716, in _dispatch job = self._backend.apply_async(batch, callback=cb) File "sklearn/externals/joblib/_parallel_backends.py", line 182, in apply_async result = ImmediateResult(func) File "sklearn/externals/joblib/_parallel_backends.py", line 549, in init self.results = batch() File "sklearn/externals/joblib/parallel.py", line 225, in call for func, args, kwargs in self.items] File "sklearn/externals/joblib/parallel.py", line 225, in for func, args, kwargs in self.items] File "sklearn/model_selection/_validation.py", line 528, in _fit_and_score estimator.fit(X_train, y_train, fit_params) File "sklearn/pipeline.py", line 267, in fit self._final_estimator.fit(Xt, y, **fit_params) File "sklearn/discriminant_analysis.py", line 435, in fit raise ValueError("The number of samples must be more " ValueError: The number of samples must be more than the number of classes.

bjschoenfeld commented 5 years ago

datasets with fewer than 4 instances per class fail

I believe you, but why is it 4, not 2? We only do 2-fold cv.

emrysshevek commented 5 years ago

I think it's because with 2-fold cv the training set has half as many instances, so it needs at least 4

bjschoenfeld commented 5 years ago

I would think that if there were only two instances and two folds, one instance would go to each fold. The folds would take turns being the train and test sets...

byu-dml / metalearn

Handling too few classes in landmarker cross validation #170