dod-advana / gamechanger-data

GAMECHANGER aspires to be the Department’s trusted solution for evidence-based, data-driven decision-making across the universe of DoD requirements
MIT License
25 stars 16 forks source link

Corrupted PDFs cause mupdf to runtime error in Document Parser #97

Closed nawagner closed 2 years ago

nawagner commented 3 years ago

I have many folders of PDFs downloaded from a legacy database, some of which are corrupted and cannot be opened with Adobe Acrobat. I cannot attach an example here, sorry. mupdf struggles with these files and raises an error. Is there some way this could be caught and noted in the output JSON's metadata instead of stopping the entire run?

Stack Trace:

2021-07-01 14:53:54,764 - [INFO] - Document Parser has started Memory Hard Limit: -1 Soft Limit: -1 Maximum of percentage of memory use: 0.8 ____ 12826868____ 2021-07-01 14:54:01.331068: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory 2021-07-01 14:54:01.331137: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. 2021-07-01 14:54:19,395 - [INFO] - Parsing Multiple Documents: 3 Current Time = 14:54:19 2021-07-01 14:54:19,396 - [INFO] - Processing: <_MainProcess(MainProcess, started)> - Filename: Document.pdf running policy_analyics.parse on /mnt/c/Users/nwagner/Downloads/test/Document.pdf 2021-07-01 14:54:19,717 - [INFO] - Finished Processing: <_MainProcess(MainProcess, started)> - Filename: Document.pdf 2021-07-01 14:54:19,717 - [INFO] - Processing: <_MainProcess(MainProcess, started)> - Filename: Document.pdf running policy_analyics.parse on /mnt/c/Users/nwagner/Downloads/test/Document.pdf mupdf: cannot recognize version marker mupdf: no objects found Traceback (most recent call last): File "/home/nwagner/miniconda3/envs/gc/lib/python3.6/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/home/nwagner/miniconda3/envs/gc/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/nwagner/gamechanger-data/common/document_parser/main.py", line 17, in cli() File "/home/nwagner/miniconda3/envs/gc/lib/python3.6/site-packages/click/core.py", line 829, in call return self.main(args, kwargs) File "/home/nwagner/miniconda3/envs/gc/lib/python3.6/site-packages/click/core.py", line 782, in main rv = self.invoke(ctx) File "/home/nwagner/miniconda3/envs/gc/lib/python3.6/site-packages/click/core.py", line 1259, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/home/nwagner/miniconda3/envs/gc/lib/python3.6/site-packages/click/core.py", line 1066, in invoke return ctx.invoke(self.callback, ctx.params) File "/home/nwagner/miniconda3/envs/gc/lib/python3.6/site-packages/click/core.py", line 610, in invoke return callback(args, **kwargs) File "/home/nwagner/gamechanger-data/common/document_parser/cli.py", line 181, in pdf_to_json_cmd_wrapper num_ocr_threads=num_ocr_threads File "/home/nwagner/gamechanger-data/common/document_parser/cli.py", line 71, in pdf_to_json num_ocr_threads=num_ocr_threads File "/home/nwagner/gamechanger-data/common/document_parser/process.py", line 189, in process_dir single_process(item) File "/home/nwagner/gamechanger-data/common/document_parser/process.py", line 112, in single_process num_ocr_threads=num_ocr_threads, force_ocr=force_ocr, out_dir=out_dir) File "/home/nwagner/gamechanger-data/common/document_parser/parsers/policy_analytics/parse.py", line 37, in parse doc_obj = pdf_reader.get_fitz_doc_obj(f_name) File "/home/nwagner/gamechanger-data/common/document_parser/lib/pdf_reader.py", line 8, in get_fitz_doc_obj doc = fitz.open(f_name) File "/home/nwagner/miniconda3/envs/gc/lib/python3.6/site-packages/fitz/fitz.py", line 2494, in init _fitz.Document_swiginit(self, _fitz.new_Document(filename, stream, filetype, rect, width, height, fontsize)) RuntimeError: no objects found

takao8 commented 3 years ago

Hi Nicholas, we actually made a framework a little while ago to be able to just capture metadata in PDFs that failed to download, but this error tells us it might be something else. I know you said you can't attach an example but is there any other way we could get some form of a corrupted PDF like the ones your working with to test with? It's hard for us to recreate this error otherwise.

nawagner commented 3 years ago

Unfortunately, I tried recreating a similarly corrupted file from an arxiv PDF by scrambling contents randomly and removing the header, but it doesn't seem to cause gamechanger (and by gamechanger I assume mupdf) any major issues. Let me share my conda list to see if there is anything outdated you know about.

packages in environment at /home/nwagner/miniconda3/envs/gc:

#

Name Version Build Channel

_libgcc_mutex 0.1 main _openmp_mutex 4.5 1_gnu absl-py 0.11.0 pypi_0 pypi alembic 1.4.1 pypi_0 pypi aniso8601 8.0.0 pypi_0 pypi annoy 1.16.3 pypi_0 pypi appdirs 1.4.4 pypi_0 pypi apscheduler 3.6.3 pypi_0 pypi arabic-reshaper 2.1.0 pypi_0 pypi asgiref 3.2.10 pypi_0 pypi astroid 2.6.2 pypi_0 pypi astunparse 1.6.3 pypi_0 pypi attrs 19.3.0 pypi_0 pypi automat 20.2.0 pypi_0 pypi azure-core 1.8.2 pypi_0 pypi azure-storage-blob 12.5.0 pypi_0 pypi backcall 0.2.0 pypi_0 pypi beautifulsoup4 4.9.1 pypi_0 pypi bert-extractive-summarizer 0.5.1 pypi_0 pypi blis 0.4.1 pypi_0 pypi boto 2.49.0 pypi_0 pypi boto3 1.13.2 pypi_0 pypi botocore 1.16.2 pypi_0 pypi ca-certificates 2021.5.25 h06a4308_1 cachetools 4.1.0 pypi_0 pypi catalogue 2.0.4 pypi_0 pypi certifi 2020.12.5 pypi_0 pypi cffi 1.14.0 pypi_0 pypi chardet 3.0.4 pypi_0 pypi ci-info 0.2.0 pypi_0 pypi click 7.1.2 pypi_0 pypi cloudpickle 1.6.0 pypi_0 pypi colorama 0.4.3 pypi_0 pypi coloredlogs 14.0 pypi_0 pypi configobj 5.0.6 pypi_0 pypi configparser 5.0.0 pypi_0 pypi constantly 15.1.0 pypi_0 pypi contextvars 2.4 pypi_0 pypi corenlp 0.0.14 pypi_0 pypi corenlp-protobuf 3.8.0 pypi_0 pypi coverage 5.3 pypi_0 pypi cryptography 2.9.2 pypi_0 pypi cssselect 1.1.0 pypi_0 pypi cycler 0.10.0 pypi_0 pypi cymem 2.0.3 pypi_0 pypi cython 0.29.23 pypi_0 pypi cytoolz 0.10.1 pypi_0 pypi databricks-cli 0.13.0 pypi_0 pypi dataclasses 0.7 pypi_0 pypi decorator 4.4.2 pypi_0 pypi devtools 0.6.1 pypi_0 pypi dill 0.3.3 pypi_0 pypi distlib 0.3.1 pypi_0 pypi django 3.0.7 pypi_0 pypi docker 4.3.1 pypi_0 pypi docutils 0.15.2 pypi_0 pypi dotmap 1.3.0 pypi_0 pypi elastic-apm 5.9.0 pypi_0 pypi elasticsearch 7.9.1 pypi_0 pypi eli5 0.10.1 pypi_0 pypi en-core-web-lg 3.0.0 pypi_0 pypi en-core-web-md 3.0.0 pypi_0 pypi en-core-web-sm 3.0.0 pypi_0 pypi english 2020.7.0 pypi_0 pypi entrypoints 0.3 pypi_0 pypi etelemetry 0.2.1 pypi_0 pypi faiss-cpu 1.6.3 pypi_0 pypi faiss-gpu 1.6.3 pypi_0 pypi farm 0.6.2 pypi_0 pypi farm-haystack 0.7.0 pypi_0 pypi fastapi 0.61.1 pypi_0 pypi fastapi-utils 0.2.1 pypi_0 pypi fasteners 0.16 pypi_0 pypi fasttext 0.9.2 pypi_0 pypi fasttext-wheel 0.9.2 pypi_0 pypi filelock 3.0.12 pypi_0 pypi filetype 1.0.7 pypi_0 pypi flake8 3.9.2 pypi_0 pypi flask 1.1.2 pypi_0 pypi flask-cors 3.0.9 pypi_0 pypi flask-restplus 0.13.0 pypi_0 pypi flatbuffers 1.12 pypi_0 pypi future 0.18.2 pypi_0 pypi gamechanger evergreen pypi_0 pypi gamechangerml 0.1.0 dev_0 gast 0.3.3 pypi_0 pypi gensim 3.8.3 pypi_0 pypi gitdb 4.0.5 pypi_0 pypi gitpython 3.1.11 pypi_0 pypi google-auth 1.16.1 pypi_0 pypi google-auth-oauthlib 0.4.1 pypi_0 pypi google-pasta 0.2.0 pypi_0 pypi gorilla 0.3.0 pypi_0 pypi grpcio 1.32.0 pypi_0 pypi gunicorn 20.0.4 pypi_0 pypi h11 0.9.0 pypi_0 pypi h5py 2.10.0 pypi_0 pypi hnswlib 0.5.1 pypi_0 pypi html5lib 1.1 pypi_0 pypi httplib2 0.18.1 pypi_0 pypi httptools 0.1.1 pypi_0 pypi humanfriendly 8.2 pypi_0 pypi hyperlink 19.0.0 pypi_0 pypi hypothesis 6.14.0 pypi_0 pypi idna 2.9 pypi_0 pypi image 1.5.32 pypi_0 pypi img2pdf 0.4.0 pypi_0 pypi immutables 0.15 pypi_0 pypi importlib-metadata 1.6.0 pypi_0 pypi importlib-resources 3.3.0 pypi_0 pypi incremental 17.5.0 pypi_0 pypi iniconfig 1.1.1 pypi_0 pypi ipython 7.16.1 pypi_0 pypi ipython-genutils 0.2.0 pypi_0 pypi isodate 0.6.0 pypi_0 pypi isort 5.9.1 pypi_0 pypi itsdangerous 1.1.0 pypi_0 pypi jedi 0.17.2 pypi_0 pypi jellyfish 0.8.2 pypi_0 pypi jinja2 2.11.2 pypi_0 pypi jmespath 0.9.5 pypi_0 pypi joblib 0.15.1 pypi_0 pypi jsonschema 3.2.0 pypi_0 pypi keras 2.3.1 pypi_0 pypi keras-applications 1.0.8 pypi_0 pypi keras-preprocessing 1.1.2 pypi_0 pypi kiwisolver 1.3.1 pypi_0 pypi langdetect 1.0.8 pypi_0 pypi lazy-object-proxy 1.6.0 pypi_0 pypi ld_impl_linux-64 2.35.1 h7274673_9 libffi 3.3 he6710b0_2 libgcc-ng 9.3.0 h5101ec6_17 libgomp 9.3.0 h5101ec6_17 libstdcxx-ng 9.3.0 hd4cf53a_17 lxml 4.5.1 pypi_0 pypi lz4 3.1.3 pypi_0 pypi mako 1.1.3 pypi_0 pypi markdown 3.2.2 pypi_0 pypi markupsafe 1.1.1 pypi_0 pypi matplotlib 3.3.4 pypi_0 pypi mccabe 0.6.1 pypi_0 pypi mlflow 1.0.0 pypi_0 pypi monotonic 1.5 pypi_0 pypi more-itertools 8.6.0 pypi_0 pypi msrest 0.6.19 pypi_0 pypi murmurhash 1.0.2 pypi_0 pypi mypy 0.910 pypi_0 pypi mypy-extensions 0.4.3 pypi_0 pypi ncurses 6.2 he6710b0_1 neo4j 4.1.1 pypi_0 pypi neobolt 1.7.17 pypi_0 pypi neotime 1.7.4 pypi_0 pypi networkx 2.4 pypi_0 pypi neuralcoref 4.0 pypi_0 pypi neurdflib 5.0.1 pypi_0 pypi nibabel 3.1.0 pypi_0 pypi nipype 1.5.0 pypi_0 pypi nltk 3.5 pypi_0 pypi nose 1.3.7 pypi_0 pypi numpy 1.19.5 pypi_0 pypi oauthlib 3.1.0 pypi_0 pypi ocrmypdf 11.3.2 pypi_0 pypi openssl 1.1.1k h27cfd23_0 opt-einsum 3.3.0 pypi_0 pypi packaging 20.4 pypi_0 pypi pandas 1.0.4 pypi_0 pypi pansi 2020.7.3 pypi_0 pypi parsel 1.6.0 pypi_0 pypi parso 0.7.1 pypi_0 pypi pathy 0.5.2 pypi_0 pypi pdfminer-six 20201018 pypi_0 pypi pexpect 4.8.0 pypi_0 pypi pickleshare 0.7.5 pypi_0 pypi pikepdf 2.0.0 pypi_0 pypi pillow 8.0.1 pypi_0 pypi pip 21.1.3 py36h06a4308_0 plac 1.1.3 pypi_0 pypi pluggy 0.13.1 pypi_0 pypi preshed 3.0.2 pypi_0 pypi prometheus-client 0.8.0 pypi_0 pypi prometheus-flask-exporter 0.18.1 pypi_0 pypi prompt-toolkit 2.0.10 pypi_0 pypi protego 0.1.16 pypi_0 pypi protobuf 3.12.2 pypi_0 pypi prov 1.5.3 pypi_0 pypi psutil 5.7.3 pypi_0 pypi psycopg2-binary 2.8.6 pypi_0 pypi ptyprocess 0.7.0 pypi_0 pypi py 1.9.0 pypi_0 pypi py2neo 2021.0.0 pypi_0 pypi pyasn1 0.4.8 pypi_0 pypi pyasn1-modules 0.2.8 pypi_0 pypi pybind11 2.6.2 pypi_0 pypi pycodestyle 2.7.0 pypi_0 pypi pycparser 2.20 pypi_0 pypi pydantic 1.7.4 pypi_0 pypi pydispatcher 2.0.5 pypi_0 pypi pydot 1.4.1 pypi_0 pypi pydotplus 2.0.2 pypi_0 pypi pyflakes 2.3.1 pypi_0 pypi pygments 2.3.1 pypi_0 pypi pyhamcrest 2.0.2 pypi_0 pypi pylint 2.9.3 pypi_0 pypi pymagnitude-lite 0.1.143 pypi_0 pypi pymupdf 1.17.2 pypi_0 pypi pyopenssl 19.1.0 pypi_0 pypi pyparsing 2.4.7 pypi_0 pypi pypdf2 1.26.0 pypi_0 pypi pyrsistent 0.16.0 pypi_0 pypi pytesseract 0.3.4 pypi_0 pypi pytest 6.2.4 pypi_0 pypi python 3.6.13 h12debd9_1 python-bidi 0.4.2 pypi_0 pypi python-dateutil 2.8.1 pypi_0 pypi python-docx 0.8.10 pypi_0 pypi python-editor 1.0.4 pypi_0 pypi python-graphviz 0.14 pypi_0 pypi python-multipart 0.0.5 pypi_0 pypi pytz 2020.1 pypi_0 pypi pyxnat 1.3 pypi_0 pypi pyyaml 5.4.1 pypi_0 pypi querystring-parser 1.2.4 pypi_0 pypi queuelib 1.5.0 pypi_0 pypi rdflib 5.0.0 pypi_0 pypi readline 8.1 h27cfd23_0 redis 3.5.3 pypi_0 pypi regex 2020.5.14 pypi_0 pypi reportlab 3.5.55 pypi_0 pypi requests 2.23.0 pypi_0 pypi requests-oauthlib 1.3.0 pypi_0 pypi rsa 4.0 pypi_0 pypi s3transfer 0.3.3 pypi_0 pypi sacremoses 0.0.43 pypi_0 pypi scikit-learn 0.23.1 pypi_0 pypi scipy 1.4.1 pypi_0 pypi scrapy 2.1.0 pypi_0 pypi seaborn 0.11.1 pypi_0 pypi selenium 3.141.0 pypi_0 pypi sentence-transformers 0.4.1.2 pypi_0 pypi sentencepiece 0.1.94 pypi_0 pypi seqeval 1.2.2 pypi_0 pypi service-identity 18.1.0 pypi_0 pypi setuptools 57.0.0 pypi_0 pypi setuptools-scm 6.0.1 pypi_0 pypi simplejson 3.17.0 pypi_0 pypi six 1.15.0 pypi_0 pypi sklearn 0.0 pypi_0 pypi smart-open 3.0.0 pypi_0 pypi smmap 3.0.4 pypi_0 pypi sortedcontainers 2.2.2 pypi_0 pypi soupsieve 2.0.1 pypi_0 pypi spacy 3.0.6 pypi_0 pypi spacy-legacy 3.0.5 pypi_0 pypi spicy 0.16.0 pypi_0 pypi sqlalchemy 1.3.13 pypi_0 pypi sqlalchemy-utils 0.36.8 pypi_0 pypi sqlite 3.36.0 hc218d9a_0 sqlparse 0.4.1 pypi_0 pypi srsly 2.4.1 pypi_0 pypi starlette 0.13.6 pypi_0 pypi syntok 1.3.1 pypi_0 pypi tabulate 0.8.7 pypi_0 pypi tensorboard 2.4.1 pypi_0 pypi tensorboard-plugin-wit 1.7.0 pypi_0 pypi tensorflow 2.4.1 pypi_0 pypi tensorflow-estimator 2.4.0 pypi_0 pypi termcolor 1.1.0 pypi_0 pypi textblob 0.15.3 pypi_0 pypi thinc 8.0.4 pypi_0 pypi threadpoolctl 2.1.0 pypi_0 pypi tika 1.24 pypi_0 pypi tk 8.6.10 hbc83047_0 tokenizers 0.9.4 pypi_0 pypi toml 0.10.2 pypi_0 pypi toolz 0.10.0 pypi_0 pypi torch 1.7.1 pypi_0 pypi torchvision 0.8.2 pypi_0 pypi tox 3.20.1 pypi_0 pypi tqdm 4.46.1 pypi_0 pypi traitlets 4.3.3 pypi_0 pypi traits 6.1.0 pypi_0 pypi transformers 4.1.1 pypi_0 pypi twisted 20.3.0 pypi_0 pypi txtai 2.0.0 pypi_0 pypi typed-ast 1.4.1 pypi_0 pypi typer 0.3.2 pypi_0 pypi typing-extensions 3.7.4.3 pypi_0 pypi tzlocal 2.1 pypi_0 pypi urllib3 1.24.3 pypi_0 pypi uvicorn 0.13.3 pypi_0 pypi uvloop 0.14.0 pypi_0 pypi virtualenv 20.1.0 pypi_0 pypi w3lib 1.22.0 pypi_0 pypi wasabi 0.8.2 pypi_0 pypi wcwidth 0.2.3 pypi_0 pypi webencodings 0.5.1 pypi_0 pypi websocket-client 0.57.0 pypi_0 pypi websockets 8.1 pypi_0 pypi werkzeug 0.16.1 pypi_0 pypi wget 3.2 pypi_0 pypi wheel 0.36.2 pyhd3eb1b0_0 wrapt 1.12.1 pypi_0 pypi xgboost 1.1.0 pypi_0 pypi xhtml2pdf 0.2.5 pypi_0 pypi xxhash 2.0.0 pypi_0 pypi xz 5.2.5 h7b6447c_0 zipp 3.1.0 pypi_0 pypi zlib 1.2.11 h7b6447c_3 zope-interface 5.1.0 pypi_0 pypi

takao8 commented 2 years ago

Hey Nicholas, we've been doing a lot of changes/bugfixes over the last couple of weeks throughout the repo (and some with our parser, too). I'm not convinced that these updates will resolve this issue as stands, but could you git pull and verify that you're still getting the same error? If you are, I'll get back to you shortly of potentially testing fixes for this on a separate branch.