davidberenstein1957 / classy-classification

This repository contains an easy and intuitive approach to few-shot classification using sentence-transformers or spaCy models, or zero-shot classification with Huggingface.
MIT License
209 stars 15 forks source link

Allow for single class predictions. #12

Closed koaning closed 1 year ago

koaning commented 2 years ago

You're not allowed to look for a single topic using this tool. Is there a reason why binary classification wouldn't work?

import spacy
import classy_classification

data = {
    "stategy": ["I really prefer strategic games.",
                "I like it when a boardgame makes you think."],
}

nlp = spacy.load("en_core_web_md")
nlp.add_pipe(
    "text_categorizer",
    config={
        "data": data,
        "model": "spacy"
    }
)

Got this error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Input In [12], in <cell line: 10>()
      4 data = {
      5     "stategy": ["I really prefer strategic games.",
      6                 "I like it when a boardgame makes you think."],
      7 }
      9 nlp = spacy.load("en_core_web_md")
---> 10 nlp.add_pipe(
     11     "text_categorizer",
     12     config={
     13         "data": data,
     14         "model": "spacy"
     15     }
     16 )

File ~/Development/prodigy-demos/venv/lib/python3.8/site-packages/spacy/language.py:795, in Language.add_pipe(self, factory_name, name, before, after, first, last, source, config, raw_config, validate)
    787     if not self.has_factory(factory_name):
    788         err = Errors.E002.format(
    789             name=factory_name,
    790             opts=", ".join(self.factory_names),
   (...)
    793             lang_code=self.lang,
    794         )
--> 795     pipe_component = self.create_pipe(
    796         factory_name,
    797         name=name,
    798         config=config,
    799         raw_config=raw_config,
    800         validate=validate,
    801     )
    802 pipe_index = self._get_pipe_index(before, after, first, last)
    803 self._pipe_meta[name] = self.get_factory_meta(factory_name)

File ~/Development/prodigy-demos/venv/lib/python3.8/site-packages/spacy/language.py:674, in Language.create_pipe(self, factory_name, name, config, raw_config, validate)
    671 cfg = {factory_name: config}
    672 # We're calling the internal _fill here to avoid constructing the
    673 # registered functions twice
--> 674 resolved = registry.resolve(cfg, validate=validate)
    675 filled = registry.fill({"cfg": cfg[factory_name]}, validate=validate)["cfg"]
    676 filled = Config(filled)

File ~/Development/prodigy-demos/venv/lib/python3.8/site-packages/thinc/config.py:747, in registry.resolve(cls, config, schema, overrides, validate)
    738 @classmethod
    739 def resolve(
    740     cls,
   (...)
    745     validate: bool = True,
    746 ) -> Dict[str, Any]:
--> 747     resolved, _ = cls._make(
    748         config, schema=schema, overrides=overrides, validate=validate, resolve=True
    749     )
    750     return resolved

File ~/Development/prodigy-demos/venv/lib/python3.8/site-packages/thinc/config.py:796, in registry._make(cls, config, schema, overrides, resolve, validate)
    794 if not is_interpolated:
    795     config = Config(orig_config).interpolate()
--> 796 filled, _, resolved = cls._fill(
    797     config, schema, validate=validate, overrides=overrides, resolve=resolve
    798 )
    799 filled = Config(filled, section_order=section_order)
    800 # Check that overrides didn't include invalid properties not in config

File ~/Development/prodigy-demos/venv/lib/python3.8/site-packages/thinc/config.py:868, in registry._fill(cls, config, schema, validate, resolve, parent, overrides)
    865     getter = cls.get(reg_name, func_name)
    866     # We don't want to try/except this and raise our own error
    867     # here, because we want the traceback if the function fails.
--> 868     getter_result = getter(*args, **kwargs)
    869 else:
    870     # We're not resolving and calling the function, so replace
    871     # the getter_result with a Promise class
    872     getter_result = Promise(
    873         registry=reg_name, name=func_name, args=args, kwargs=kwargs
    874     )

File ~/Development/prodigy-demos/venv/lib/python3.8/site-packages/classy_classification/__init__.py:41, in make_text_categorizer(nlp, name, data, device, config, model, cat_type, include_doc, include_sent)
     39     if cat_type == "zero":
     40         raise NotImplementedError("cannot use spacy internal embeddings with zero-shot classification")
---> 41     return classySpacyInternal(
     42         nlp=nlp, name=name, data=data, config=config, include_doc=include_doc, include_sent=include_sent
     43     )
     44 else:
     45     if cat_type == "zero":

File ~/Development/prodigy-demos/venv/lib/python3.8/site-packages/classy_classification/classifiers/spacy_internal.py:23, in classySpacyInternal.__init__(self, nlp, name, data, config, include_doc, include_sent)
     21 self.nlp = nlp
     22 self.set_training_data()
---> 23 self.set_svc()

File ~/Development/prodigy-demos/venv/lib/python3.8/site-packages/classy_classification/classifiers/classy_skeleton.py:144, in classySkeleton.set_svc(self, config)
    135 cv_splits = max(2, min(folds, np.min(np.bincount(self.y)) // 5))
    136 self.clf = GridSearchCV(
    137     SVC(C=1, probability=True, class_weight="balanced"),
    138     param_grid=tuned_parameters,
   (...)
    142     verbose=0,
    143 )
--> 144 self.clf.fit(self.X, self.y)

File ~/Development/prodigy-demos/venv/lib/python3.8/site-packages/sklearn/model_selection/_search.py:875, in BaseSearchCV.fit(self, X, y, groups, **fit_params)
    869     results = self._format_results(
    870         all_candidate_params, n_splits, all_out, all_more_results
    871     )
    873     return results
--> 875 self._run_search(evaluate_candidates)
    877 # multimetric is determined here because in the case of a callable
    878 # self.scoring the return type is only known after calling
    879 first_test_score = all_out[0]["test_scores"]

File ~/Development/prodigy-demos/venv/lib/python3.8/site-packages/sklearn/model_selection/_search.py:1375, in GridSearchCV._run_search(self, evaluate_candidates)
   1373 def _run_search(self, evaluate_candidates):
   1374     """Search all candidates in param_grid"""
-> 1375     evaluate_candidates(ParameterGrid(self.param_grid))

File ~/Development/prodigy-demos/venv/lib/python3.8/site-packages/sklearn/model_selection/_search.py:852, in BaseSearchCV.fit.<locals>.evaluate_candidates(candidate_params, cv, more_results)
    845 elif len(out) != n_candidates * n_splits:
    846     raise ValueError(
    847         "cv.split and cv.get_n_splits returned "
    848         "inconsistent results. Expected {} "
    849         "splits, got {}".format(n_splits, len(out) // n_candidates)
    850     )
--> 852 _warn_or_raise_about_fit_failures(out, self.error_score)
    854 # For callable self.scoring, the return type is only know after
    855 # calling. If the return type is a dictionary, the error scores
    856 # can now be inserted with the correct key. The type checking
    857 # of out will be done in `_insert_error_scores`.
    858 if callable(self.scoring):

File ~/Development/prodigy-demos/venv/lib/python3.8/site-packages/sklearn/model_selection/_validation.py:367, in _warn_or_raise_about_fit_failures(results, error_score)
    360 if num_failed_fits == num_fits:
    361     all_fits_failed_message = (
    362         f"\nAll the {num_fits} fits failed.\n"
    363         "It is very likely that your model is misconfigured.\n"
    364         "You can try to debug the error by setting error_score='raise'.\n\n"
    365         f"Below are more details about the failures:\n{fit_errors_summary}"
    366     )
--> 367     raise ValueError(all_fits_failed_message)
    369 else:
    370     some_fits_failed_message = (
    371         f"\n{num_failed_fits} fits failed out of a total of {num_fits}.\n"
    372         "The score on these train-test partitions for these parameters"
   (...)
    376         f"Below are more details about the failures:\n{fit_errors_summary}"
    377     )

ValueError: 
All the 12 fits failed.
It is very likely that your model is misconfigured.
You can try to debug the error by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
12 fits failed with the following error:
Traceback (most recent call last):
  File "/home/vincent/Development/prodigy-demos/venv/lib/python3.8/site-packages/sklearn/model_selection/_validation.py", line 686, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/home/vincent/Development/prodigy-demos/venv/lib/python3.8/site-packages/sklearn/svm/_base.py", line 182, in fit
    y = self._validate_targets(y)
  File "/home/vincent/Development/prodigy-demos/venv/lib/python3.8/site-packages/sklearn/svm/_base.py", line 739, in _validate_targets
    raise ValueError(
ValueError: The number of classes has to be greater than one; got 1 class
davidberenstein1957 commented 2 years ago

That is due to the fact that I have chosen to use a simple copy of the implementation of Rasa for intent classification using SVM. In that case, all positive examples need to be able to be separatable from counterparts.

There probably are ways to solve for this but I have not really gotten around to looking into this.

davidberenstein1957 commented 2 years ago

This will be added to a new release during the coming weekend.

*edit, I will take a look another time since I spend a bit too much time on re-factoring the code already. @koaning if you want to contribute or have suggestions. They are always welcome 🤓