TeamHG-Memex / Formasaurus

Formasaurus tells you the type of an HTML form and its fields using machine learning
116 stars 47 forks source link

Formasaurus init fails with scikit-learn 1.2.0 #31

Open mlec1 opened 1 year ago

mlec1 commented 1 year ago

It seems that the version of scikit-learn v1.2.0 releases in Dec 2022 is breaking the formasaurus init command. See the following output:

Training form type detector on 1423 example(s)...
#9 4.760 Traceback (most recent call last):
#9 4.760   File "/usr/local/bin/formasaurus", line 33, in <module>
#9 4.761     sys.exit(load_entry_point('formasaurus==0.9.0', 'console_scripts', 'formasaurus')())
#9 4.761   File "/usr/local/lib/python3.9/site-packages/formasaurus-0.9.0-py3.9.egg/formasaurus/__main__.py", line 72, in main
#9 4.761     formasaurus.FormFieldClassifier.load()
#9 4.761   File "/usr/local/lib/python3.9/site-packages/formasaurus-0.9.0-py3.9.egg/formasaurus/classifiers.py", line 101, in load
#9 4.761     ex = cls.trained_on(DEFAULT_DATA_PATH)
#9 4.761   File "/usr/local/lib/python3.9/site-packages/formasaurus-0.9.0-py3.9.egg/formasaurus/classifiers.py", line 119, in trained_on
#9 4.761     ex.train(annotations)
#9 4.761   File "/usr/local/lib/python3.9/site-packages/formasaurus-0.9.0-py3.9.egg/formasaurus/classifiers.py", line 131, in train
#9 4.761     self.form_classifier.train(annotations)
#9 4.761   File "/usr/local/lib/python3.9/site-packages/formasaurus-0.9.0-py3.9.egg/formasaurus/classifiers.py", line 266, in train
#9 4.761     self.model = formtype_model.train(
#9 4.761   File "/usr/local/lib/python3.9/site-packages/formasaurus-0.9.0-py3.9.egg/formasaurus/formtype_model.py", line 128, in train
#9 4.762     return model.fit(X, y)
#9 4.762   File "/usr/local/lib/python3.9/site-packages/sklearn/pipeline.py", line 402, in fit
#9 4.762     Xt = self._fit(X, y, **fit_params_steps)
#9 4.762   File "/usr/local/lib/python3.9/site-packages/sklearn/pipeline.py", line 360, in _fit
#9 4.762     X, fitted_transformer = fit_transform_one_cached(
#9 4.762   File "/usr/local/lib/python3.9/site-packages/joblib/memory.py", line 349, in __call__
#9 4.762     return self.func(*args, **kwargs)
#9 4.762   File "/usr/local/lib/python3.9/site-packages/sklearn/pipeline.py", line 894, in _fit_transform_one
#9 4.762     res = transformer.fit_transform(X, y, **fit_params)
#9 4.762   File "/usr/local/lib/python3.9/site-packages/sklearn/utils/_set_output.py", line 142, in wrapped
#9 4.763     data_to_wrap = f(self, X, *args, **kwargs)
#9 4.763   File "/usr/local/lib/python3.9/site-packages/sklearn/pipeline.py", line 1193, in fit_transform
#9 4.763     results = self._parallel_func(X, y, fit_params, _fit_transform_one)
#9 4.763   File "/usr/local/lib/python3.9/site-packages/sklearn/pipeline.py", line 1215, in _parallel_func
#9 4.763     return Parallel(n_jobs=self.n_jobs)(
#9 4.763   File "/usr/local/lib/python3.9/site-packages/joblib/parallel.py", line 1088, in __call__
#9 4.764     while self.dispatch_one_batch(iterator):
#9 4.764   File "/usr/local/lib/python3.9/site-packages/joblib/parallel.py", line 901, in dispatch_one_batch
#9 4.764     self._dispatch(tasks)
#9 4.764   File "/usr/local/lib/python3.9/site-packages/joblib/parallel.py", line 819, in _dispatch
#9 4.764     job = self._backend.apply_async(batch, callback=cb)
#9 4.764   File "/usr/local/lib/python3.9/site-packages/joblib/_parallel_backends.py", line 208, in apply_async
#9 4.764     result = ImmediateResult(func)
#9 4.764   File "/usr/local/lib/python3.9/site-packages/joblib/_parallel_backends.py", line 597, in __init__
#9 4.764     self.results = batch()
#9 4.764   File "/usr/local/lib/python3.9/site-packages/joblib/parallel.py", line 288, in __call__
#9 4.765     return [func(*args, **kwargs)
#9 4.765   File "/usr/local/lib/python3.9/site-packages/joblib/parallel.py", line 288, in <listcomp>
#9 4.765     return [func(*args, **kwargs)
#9 4.765   File "/usr/local/lib/python3.9/site-packages/sklearn/utils/fixes.py", line 117, in __call__
#9 4.765     return self.function(*args, **kwargs)
#9 4.765   File "/usr/local/lib/python3.9/site-packages/sklearn/pipeline.py", line 894, in _fit_transform_one
#9 4.765     res = transformer.fit_transform(X, y, **fit_params)
#9 4.765   File "/usr/local/lib/python3.9/site-packages/sklearn/pipeline.py", line 446, in fit_transform
#9 4.766     return last_step.fit_transform(Xt, y, **fit_params_last_step)
#9 4.766   File "/usr/local/lib/python3.9/site-packages/sklearn/feature_extraction/text.py", line 2121, in fit_transform
#9 4.766     X = super().fit_transform(raw_documents)
#9 4.766   File "/usr/local/lib/python3.9/site-packages/sklearn/feature_extraction/text.py", line 1358, in fit_transform
#9 4.768     self._validate_params()
#9 4.768   File "/usr/local/lib/python3.9/site-packages/sklearn/base.py", line 570, in _validate_params
#9 4.768     validate_parameter_constraints(
#9 4.768   File "/usr/local/lib/python3.9/site-packages/sklearn/utils/_param_validation.py", line 97, in validate_parameter_constraints
#9 4.768     raise InvalidParameterError(
#9 4.768 sklearn.utils._param_validation.InvalidParameterError: The 'stop_words' parameter of TfidfVectorizer must be a str among {'english'}, an instance of 'list' or None. Got {'and', 'of', 'or'} instead.

This command works fine with the previous version of scikit-learn v1.1.3

kmike commented 1 month ago

This should be fixed in https://github.com/scrapinghub/Formasaurus (released as 0.9.0). Unfortunately we lost access to this repo, so the development is moved to another location.