allenai / mmda

multimodal document analysis
Apache License 2.0
158 stars 18 forks source link

bugfix; not sure why but skipping token IDs in some pdfs #268

Closed kyleclo closed 11 months ago

kyleclo commented 12 months ago

In PDF 585bcfc650f744efa7942900b742c1b64863350e

Resolving this issue:

Traceback (most recent call last):
  File "/opt/ml/code/server/api.py", line 329, in perform_invocations
    prediction_batch = maybe_predictor.predict_batch(batch)
  File "/usr/local/lib/python3.10/site-packages/ai2_internal/svm_word_predictor/interface.py", line 73, in predict_batch
    return [self.predict_one(instance) for instance in instances]
  File "/usr/local/lib/python3.10/site-packages/ai2_internal/svm_word_predictor/interface.py", line 73, in <listcomp>
    return [self.predict_one(instance) for instance in instances]
  File "/usr/local/lib/python3.10/site-packages/ai2_internal/svm_word_predictor/interface.py", line 56, in predict_one
    words = self._predictor.predict(doc)
  File "/usr/local/lib/python3.10/site-packages/mmda/predictors/sklearn_predictors/svm_word_predictor.py", line 257, in predict
    hyphen_word_candidates = self._find_hyphen_word_candidates(
  File "/usr/local/lib/python3.10/site-packages/mmda/predictors/sklearn_predictors/svm_word_predictor.py", line 504, in _find_hyphen_word_candidates
    prefix_word_id = token_id_to_word_id[hyphen_token_id]
KeyError: 11941