chrismattmann / tika-python

Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.
Apache License 2.0
1.49k stars 234 forks source link

Tika parser with TesseractOCR #363

Closed tarunsharma2015 closed 1 year ago

tarunsharma2015 commented 2 years ago

I am facing problem while extracting content from pdf, the returned content is None in case of pdf images. The same code seems to be working on my local setup whereas failing on aws lambda.

I have followed to install tesseract on AWS lambda as layer still not able to find the error.

Output from tika parser on AWS lambda -

{'metadata': {'Content-Type': 'application/pdf', 'X-Parsed-By': ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.pdf.PDFParser'], 'X-TIKA:embedded_depth': '0', 'X-TIKA:parse_time_millis': '563', 'access_permission:assemble_document': 'true', 'access_permission:can_modify': 'true', 'access_permission:can_print': 'true', 'access_permission:can_print_degraded': 'true', 'access_permission:extract_content': 'true', 'access_permission:extract_for_accessibility': 'true', 'access_permission:fill_in_form': 'true', 'access_permission:modify_annotations': 'true', 'dc:format': 'application/pdf; version=1.4', 'pdf:PDFVersion': '1.4', 'pdf:charsPerPage': '0', 'pdf:encrypted': 'false', 'pdf:hasMarkedContent': 'false', 'pdf:hasXFA': 'false', 'pdf:hasXMP': 'false', 'pdf:unmappedUnicodeCharsPerPage': '0', 'resourceName': "b'test-img-tika-1.pdf'", 'xmpTPg:NPages': '1'}, 'content': None, 'status': 200}

Tesseract Path - All libraries placed under lib folder of zip to be uploaded on AWS lambda.

Libraries - liblept.so.5, libtesseract.so.4,libwebp.so.4 Code -

import boto3 import tika from tika import parser import tempfile, os import requests from urllib.request import urlretrieve from tika import detector

SCRIPT_DIR = os.path.dirname(os.path.abspath(file)) LIB_DIR = os.path.join(SCRIPT_DIR, 'lib') print(f"library path : {LIB_DIR}") os.system(f'export LD_LIBRARY_PATH={LIB_DIR}')

def lambda_handler(event, context): tika.initVM() headers = {"X-Tika-OCRLanguage": "eng"} parsed = parser.from_file(filepath, headers=headers) print(parsed)

PS : the same code seems to be working fine if there is no image content in pdf.

ln-P commented 2 years ago

Hey, have a look at fix I proposed https://github.com/chrismattmann/tika-python/pull/364, and then set headers in from_file() to headers={'X-Tika-PDFextractInlineImages': 'true'}

chrismattmann commented 1 year ago

Yes merged #364 so hopefully this addressed it.