byu-dml / metalearn

BYU's python library of useable tools for metalearning
MIT License
22 stars 6 forks source link

Handling too few classes in landmarker cross validation #170

Open bjschoenfeld opened 5 years ago

bjschoenfeld commented 5 years ago

Our landmarkers perform cross validation with 2 folds. Some datasets may have only 1 instance of a particular target class. In this case, the validation in sklearn's cross validation throws an error, requiring at least n_folds (2 in our case) instances of each class. This is not pretty to have such an error thrown. How should we handle this?

emrysshevek commented 5 years ago

Running on LL0_488_colleges_aaup dataset

Traceback (most recent call last):
  File "venv/lib/python3.6/site-packages/metalearn/metafeatures/metafeatures.py", line 113, in compute
    n_folds, verbose
  File "venv/lib/python3.6/site-packages/metalearn/metafeatures/metafeatures.py", line 234, in _validate_compute_arguments
    n_folds, verbose
  File "venv/lib/python3.6/site-packages/metalearn/metafeatures/metafeatures.py", line 348, in _validate_n_folds
    f"{group.shape[0]}."
ValueError: The minimum number of instances in each class of Y is n_folds=2. Class VIIB has 1.
bjschoenfeld commented 5 years ago

Can we compare with OpenML on this?

emrysshevek commented 5 years ago

Similar to this, datasets with fewer than 4 instances per class fail. Should we handle something like this?

import pandas as pd import numpy as np from metalearn import Metafeatures x = pd.DataFrame(np.random.rand(8,2)) y = pd.Series(['a','a','a','b','b','b']) Metafeatures().compute(x,y)

Traceback (most recent call last): File "", line 1, in File "metalearn/metafeatures/metafeatures.py", line 158, in compute value, compute_time = self._get_resource(metafeature_id) File "metalearn/metafeatures/metafeatures.py", line 390, in _get_resource computed_resources = f(args) File "metalearn/metafeatures/landmarking_metafeatures.py", line 72, in get_lda return run_pipeline(X, Y, pipeline, n_folds, cv_seed) File "metalearn/metafeatures/landmarking_metafeatures.py", line 34, in run_pipeline 'accuracy': accuracy_scorer, 'kappa': kappa_scorer File "sklearn/model_selection/_validation.py", line 240, in cross_validate for train, test in cv.split(X, y, groups)) File "sklearn/externals/joblib/parallel.py", line 917, in call if self.dispatch_one_batch(iterator): File "sklearn/externals/joblib/parallel.py", line 759, in dispatch_one_batch self._dispatch(tasks) File "sklearn/externals/joblib/parallel.py", line 716, in _dispatch job = self._backend.apply_async(batch, callback=cb) File "sklearn/externals/joblib/_parallel_backends.py", line 182, in apply_async result = ImmediateResult(func) File "sklearn/externals/joblib/_parallel_backends.py", line 549, in init self.results = batch() File "sklearn/externals/joblib/parallel.py", line 225, in call for func, args, kwargs in self.items] File "sklearn/externals/joblib/parallel.py", line 225, in for func, args, kwargs in self.items] File "sklearn/model_selection/_validation.py", line 528, in _fit_and_score estimator.fit(X_train, y_train, fit_params) File "sklearn/pipeline.py", line 267, in fit self._final_estimator.fit(Xt, y, **fit_params) File "sklearn/discriminant_analysis.py", line 435, in fit raise ValueError("The number of samples must be more " ValueError: The number of samples must be more than the number of classes.

bjschoenfeld commented 5 years ago

datasets with fewer than 4 instances per class fail

I believe you, but why is it 4, not 2? We only do 2-fold cv.

emrysshevek commented 5 years ago

I think it's because with 2-fold cv the training set has half as many instances, so it needs at least 4

bjschoenfeld commented 5 years ago

I would think that if there were only two instances and two folds, one instance would go to each fold. The folds would take turns being the train and test sets...