I'm using Apache Tika to OCR a bunch of PDFs. When I use the GUI (by doing java -jar tika-app-1.22.jar) everything works fine: I go to "Recursive JSON" on the "View" menu and the text is all there (even though nothing appears on "Main Content"). But when I use the Python wrapper I don't see any option to extract any "Recursive JSON" objects; and print(parsed['content']) returns an empty string. (Though print(parsed['metadata']) returns the metadata correctly. But I need the content.) What am I missing?
Without the file you were testing I can't really comment on this? Seems like the error stems upstream from the Tika library though and I recommend asking this one on dev@tika.apache.org. @tballison
I'm using Apache Tika to OCR a bunch of PDFs. When I use the GUI (by doing java -jar tika-app-1.22.jar) everything works fine: I go to "Recursive JSON" on the "View" menu and the text is all there (even though nothing appears on "Main Content"). But when I use the Python wrapper I don't see any option to extract any "Recursive JSON" objects; and print(parsed['content']) returns an empty string. (Though print(parsed['metadata']) returns the metadata correctly. But I need the content.) What am I missing?