Closed nawagner closed 2 years ago
Hi Nicholas, we actually made a framework a little while ago to be able to just capture metadata in PDFs that failed to download, but this error tells us it might be something else. I know you said you can't attach an example but is there any other way we could get some form of a corrupted PDF like the ones your working with to test with? It's hard for us to recreate this error otherwise.
Unfortunately, I tried recreating a similarly corrupted file from an arxiv PDF by scrambling contents randomly and removing the header, but it doesn't seem to cause gamechanger (and by gamechanger I assume mupdf) any major issues. Let me share my conda list to see if there is anything outdated you know about.
#
_libgcc_mutex 0.1 main
_openmp_mutex 4.5 1_gnu
absl-py 0.11.0 pypi_0 pypi
alembic 1.4.1 pypi_0 pypi
aniso8601 8.0.0 pypi_0 pypi
annoy 1.16.3 pypi_0 pypi
appdirs 1.4.4 pypi_0 pypi
apscheduler 3.6.3 pypi_0 pypi
arabic-reshaper 2.1.0 pypi_0 pypi
asgiref 3.2.10 pypi_0 pypi
astroid 2.6.2 pypi_0 pypi
astunparse 1.6.3 pypi_0 pypi
attrs 19.3.0 pypi_0 pypi
automat 20.2.0 pypi_0 pypi
azure-core 1.8.2 pypi_0 pypi
azure-storage-blob 12.5.0 pypi_0 pypi
backcall 0.2.0 pypi_0 pypi
beautifulsoup4 4.9.1 pypi_0 pypi
bert-extractive-summarizer 0.5.1 pypi_0 pypi
blis 0.4.1 pypi_0 pypi
boto 2.49.0 pypi_0 pypi
boto3 1.13.2 pypi_0 pypi
botocore 1.16.2 pypi_0 pypi
ca-certificates 2021.5.25 h06a4308_1
cachetools 4.1.0 pypi_0 pypi
catalogue 2.0.4 pypi_0 pypi
certifi 2020.12.5 pypi_0 pypi
cffi 1.14.0 pypi_0 pypi
chardet 3.0.4 pypi_0 pypi
ci-info 0.2.0 pypi_0 pypi
click 7.1.2 pypi_0 pypi
cloudpickle 1.6.0 pypi_0 pypi
colorama 0.4.3 pypi_0 pypi
coloredlogs 14.0 pypi_0 pypi
configobj 5.0.6 pypi_0 pypi
configparser 5.0.0 pypi_0 pypi
constantly 15.1.0 pypi_0 pypi
contextvars 2.4 pypi_0 pypi
corenlp 0.0.14 pypi_0 pypi
corenlp-protobuf 3.8.0 pypi_0 pypi
coverage 5.3 pypi_0 pypi
cryptography 2.9.2 pypi_0 pypi
cssselect 1.1.0 pypi_0 pypi
cycler 0.10.0 pypi_0 pypi
cymem 2.0.3 pypi_0 pypi
cython 0.29.23 pypi_0 pypi
cytoolz 0.10.1 pypi_0 pypi
databricks-cli 0.13.0 pypi_0 pypi
dataclasses 0.7 pypi_0 pypi
decorator 4.4.2 pypi_0 pypi
devtools 0.6.1 pypi_0 pypi
dill 0.3.3 pypi_0 pypi
distlib 0.3.1 pypi_0 pypi
django 3.0.7 pypi_0 pypi
docker 4.3.1 pypi_0 pypi
docutils 0.15.2 pypi_0 pypi
dotmap 1.3.0 pypi_0 pypi
elastic-apm 5.9.0 pypi_0 pypi
elasticsearch 7.9.1 pypi_0 pypi
eli5 0.10.1 pypi_0 pypi
en-core-web-lg 3.0.0 pypi_0 pypi
en-core-web-md 3.0.0 pypi_0 pypi
en-core-web-sm 3.0.0 pypi_0 pypi
english 2020.7.0 pypi_0 pypi
entrypoints 0.3 pypi_0 pypi
etelemetry 0.2.1 pypi_0 pypi
faiss-cpu 1.6.3 pypi_0 pypi
faiss-gpu 1.6.3 pypi_0 pypi
farm 0.6.2 pypi_0 pypi
farm-haystack 0.7.0 pypi_0 pypi
fastapi 0.61.1 pypi_0 pypi
fastapi-utils 0.2.1 pypi_0 pypi
fasteners 0.16 pypi_0 pypi
fasttext 0.9.2 pypi_0 pypi
fasttext-wheel 0.9.2 pypi_0 pypi
filelock 3.0.12 pypi_0 pypi
filetype 1.0.7 pypi_0 pypi
flake8 3.9.2 pypi_0 pypi
flask 1.1.2 pypi_0 pypi
flask-cors 3.0.9 pypi_0 pypi
flask-restplus 0.13.0 pypi_0 pypi
flatbuffers 1.12 pypi_0 pypi
future 0.18.2 pypi_0 pypi
gamechanger evergreen pypi_0 pypi
gamechangerml 0.1.0 dev_0
Hey Nicholas, we've been doing a lot of changes/bugfixes over the last couple of weeks throughout the repo (and some with our parser, too). I'm not convinced that these updates will resolve this issue as stands, but could you git pull and verify that you're still getting the same error? If you are, I'll get back to you shortly of potentially testing fixes for this on a separate branch.
I have many folders of PDFs downloaded from a legacy database, some of which are corrupted and cannot be opened with Adobe Acrobat. I cannot attach an example here, sorry. mupdf struggles with these files and raises an error. Is there some way this could be caught and noted in the output JSON's metadata instead of stopping the entire run?
Stack Trace:
2021-07-01 14:53:54,764 - [INFO] - Document Parser has started Memory Hard Limit: -1 Soft Limit: -1 Maximum of percentage of memory use: 0.8 ____ 12826868____ 2021-07-01 14:54:01.331068: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory 2021-07-01 14:54:01.331137: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. 2021-07-01 14:54:19,395 - [INFO] - Parsing Multiple Documents: 3 Current Time = 14:54:19 2021-07-01 14:54:19,396 - [INFO] - Processing: <_MainProcess(MainProcess, started)> - Filename: Document.pdf running policy_analyics.parse on /mnt/c/Users/nwagner/Downloads/test/Document.pdf 2021-07-01 14:54:19,717 - [INFO] - Finished Processing: <_MainProcess(MainProcess, started)> - Filename: Document.pdf 2021-07-01 14:54:19,717 - [INFO] - Processing: <_MainProcess(MainProcess, started)> - Filename: Document.pdf running policy_analyics.parse on /mnt/c/Users/nwagner/Downloads/test/Document.pdf mupdf: cannot recognize version marker mupdf: no objects found Traceback (most recent call last): File "/home/nwagner/miniconda3/envs/gc/lib/python3.6/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/home/nwagner/miniconda3/envs/gc/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/nwagner/gamechanger-data/common/document_parser/main.py", line 17, in
cli()
File "/home/nwagner/miniconda3/envs/gc/lib/python3.6/site-packages/click/core.py", line 829, in call
return self.main(args, kwargs)
File "/home/nwagner/miniconda3/envs/gc/lib/python3.6/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/home/nwagner/miniconda3/envs/gc/lib/python3.6/site-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/nwagner/miniconda3/envs/gc/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, ctx.params)
File "/home/nwagner/miniconda3/envs/gc/lib/python3.6/site-packages/click/core.py", line 610, in invoke
return callback(args, **kwargs)
File "/home/nwagner/gamechanger-data/common/document_parser/cli.py", line 181, in pdf_to_json_cmd_wrapper
num_ocr_threads=num_ocr_threads
File "/home/nwagner/gamechanger-data/common/document_parser/cli.py", line 71, in pdf_to_json
num_ocr_threads=num_ocr_threads
File "/home/nwagner/gamechanger-data/common/document_parser/process.py", line 189, in process_dir
single_process(item)
File "/home/nwagner/gamechanger-data/common/document_parser/process.py", line 112, in single_process
num_ocr_threads=num_ocr_threads, force_ocr=force_ocr, out_dir=out_dir)
File "/home/nwagner/gamechanger-data/common/document_parser/parsers/policy_analytics/parse.py", line 37, in parse
doc_obj = pdf_reader.get_fitz_doc_obj(f_name)
File "/home/nwagner/gamechanger-data/common/document_parser/lib/pdf_reader.py", line 8, in get_fitz_doc_obj
doc = fitz.open(f_name)
File "/home/nwagner/miniconda3/envs/gc/lib/python3.6/site-packages/fitz/fitz.py", line 2494, in init
_fitz.Document_swiginit(self, _fitz.new_Document(filename, stream, filetype, rect, width, height, fontsize))
RuntimeError: no objects found