Closed kyleclo closed 1 year ago
I was branched off of this locally to work on timo-ifying. There seem to be some unspecified dependencies.
pip install .[heuristic_predictors]
still leaves you withjoblib
uninstalled, not sure what else. I suspect that library is coming in transitively via some otherextra_requires
that we don't want to depend on.
thanks for catching, realized an issue w/ our GitHub CI is the build basically assumes installing everything. i made a new workflow that's specific to word predictor. I guess need to slowly build this up for all the other predictors
May have found a showstopping bug. This seems to blow up on the very first paper I tried, see:
>>> from mmda.parsers.pdfplumber_parser import PDFPlumberParser >>> from mmda.predictors.heuristic_predictors.svm_word_predictor import SVMWordPredictor >>> parser = PDFPlumberParser(split_at_punctuation=True) >>> predictor = SVMWordPredictor.from_path("/tests/fixtures/svm_word_predictor/svm_word_predictor.tar.gz") >>> doc = parser.parse("/output/no_bibs.pdf") >>> predictor.predict(doc) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/local/lib/python3.8/site-packages/mmda/predictors/heuristic_predictors/svm_word_predictor.py", line 225, in predict results = self.classifier.batch_predict( File "/usr/local/lib/python3.8/site-packages/mmda/predictors/heuristic_predictors/svm_word_predictor.py", line 107, in batch_predict all_features, word_id_to_feature_ids = self._get_features(words) File "/usr/local/lib/python3.8/site-packages/mmda/predictors/heuristic_predictors/svm_word_predictor.py", line 177, in _get_features dense_transformed = self.scaler.transform(dense_all) File "/usr/local/lib/python3.8/site-packages/sklearn/utils/_set_output.py", line 140, in wrapped data_to_wrap = f(self, X, *args, **kwargs) File "/usr/local/lib/python3.8/site-packages/sklearn/preprocessing/_data.py", line 1233, in transform X = self._validate_data( File "/usr/local/lib/python3.8/site-packages/sklearn/base.py", line 565, in _validate_data X = check_array(X, input_name="X", **check_params) File "/usr/local/lib/python3.8/site-packages/sklearn/utils/validation.py", line 902, in check_array raise ValueError( ValueError: Expected 2D array, got 1D array instead: array=[]. Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample. >>>
The paper in question here is in this repo: src/ai2_internal/bib_entry_detection_predictor/data/no_bibs.pdf
How many full documents was this predictor verified on?
[EDIT] If you dig a bit deeper with the above, it claims there are no candidate words with hyphens, but there are definitely hyphenated words in the pdf.
gah! nice catch, sry I only tested on two column papers. one-column papers rarely have hyphens that need correcting (the test example doesn't have one). I added a handling for this + test.
I was branched off of this locally to work on timo-ifying. There seem to be some unspecified dependencies.
pip install .[heuristic_predictors]
still leaves you withjoblib
uninstalled, not sure what else. I suspect that library is coming in transitively via some otherextra_requires
that we don't want to depend on.