allenai / mmda

multimodal document analysis
Apache License 2.0
158 stars 18 forks source link

SVM for predicting hyphenated words #247

Closed kyleclo closed 1 year ago

cmwilhelm commented 1 year ago

I was branched off of this locally to work on timo-ifying. There seem to be some unspecified dependencies.

pip install .[heuristic_predictors] still leaves you with joblib uninstalled, not sure what else. I suspect that library is coming in transitively via some other extra_requires that we don't want to depend on.

kyleclo commented 1 year ago

I was branched off of this locally to work on timo-ifying. There seem to be some unspecified dependencies.

pip install .[heuristic_predictors] still leaves you with joblib uninstalled, not sure what else. I suspect that library is coming in transitively via some other extra_requires that we don't want to depend on.

thanks for catching, realized an issue w/ our GitHub CI is the build basically assumes installing everything. i made a new workflow that's specific to word predictor. I guess need to slowly build this up for all the other predictors

May have found a showstopping bug. This seems to blow up on the very first paper I tried, see:

>>> from mmda.parsers.pdfplumber_parser import PDFPlumberParser
>>> from mmda.predictors.heuristic_predictors.svm_word_predictor import SVMWordPredictor
>>> parser = PDFPlumberParser(split_at_punctuation=True)
>>> predictor = SVMWordPredictor.from_path("/tests/fixtures/svm_word_predictor/svm_word_predictor.tar.gz")
>>> doc = parser.parse("/output/no_bibs.pdf")
>>> predictor.predict(doc)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.8/site-packages/mmda/predictors/heuristic_predictors/svm_word_predictor.py", line 225, in predict
    results = self.classifier.batch_predict(
  File "/usr/local/lib/python3.8/site-packages/mmda/predictors/heuristic_predictors/svm_word_predictor.py", line 107, in batch_predict
    all_features, word_id_to_feature_ids = self._get_features(words)
  File "/usr/local/lib/python3.8/site-packages/mmda/predictors/heuristic_predictors/svm_word_predictor.py", line 177, in _get_features
    dense_transformed = self.scaler.transform(dense_all)
  File "/usr/local/lib/python3.8/site-packages/sklearn/utils/_set_output.py", line 140, in wrapped
    data_to_wrap = f(self, X, *args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/sklearn/preprocessing/_data.py", line 1233, in transform
    X = self._validate_data(
  File "/usr/local/lib/python3.8/site-packages/sklearn/base.py", line 565, in _validate_data
    X = check_array(X, input_name="X", **check_params)
  File "/usr/local/lib/python3.8/site-packages/sklearn/utils/validation.py", line 902, in check_array
    raise ValueError(
ValueError: Expected 2D array, got 1D array instead:
array=[].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
>>>

The paper in question here is in this repo: src/ai2_internal/bib_entry_detection_predictor/data/no_bibs.pdf

How many full documents was this predictor verified on?

[EDIT] If you dig a bit deeper with the above, it claims there are no candidate words with hyphens, but there are definitely hyphenated words in the pdf.

gah! nice catch, sry I only tested on two column papers. one-column papers rarely have hyphens that need correcting (the test example doesn't have one). I added a handling for this + test.