chrismattmann / tika-python

Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.
Apache License 2.0
1.51k stars 235 forks source link

how to access Apache Tika's recursiveJSON object using python-tika? #362

Closed NLPOR closed 1 year ago

NLPOR commented 2 years ago

I'm using Apache Tika to OCR a bunch of PDFs. When I use the GUI (by doing java -jar tika-app-1.22.jar) everything works fine: I go to "Recursive JSON" on the "View" menu and the text is all there (even though nothing appears on "Main Content"). But when I use the Python wrapper I don't see any option to extract any "Recursive JSON" objects; and print(parsed['content']) returns an empty string. (Though print(parsed['metadata']) returns the metadata correctly. But I need the content.) What am I missing?

chrismattmann commented 1 year ago

Without the file you were testing I can't really comment on this? Seems like the error stems upstream from the Tika library though and I recommend asking this one on dev@tika.apache.org. @tballison