abrom / henkei

Read text and metadata from files and documents (.doc, .docx, .pages, .odt, .rtf, .pdf)
http://github.com/abrom/henkei
MIT License
74 stars 14 forks source link

Update Apache Tika to v2.2.0 #23

Closed abrom closed 2 years ago

abrom commented 2 years ago

Apache Tika v2.x brings with it some changes. One key change is that the Tika client and server applications have been split up. To keep the gem size down Henkei will only include the client app. That is to say, each time you call to Henkei, a new Java process will be started, run your command, then terminate.

Another change is the metadata keys. A lot of duplicate keys have been removed in favour of a more standards based approach. A list of the old vs new key names can be found here

Note 1: Anyone concerned about log4j's CVE-2021-44228, Tika 2.2.0 includes log4j 2.15.0 (which disables JndiLookup)

Note 2: The updated Tika will by default log an INFO message about the performance impact of the TesseractOCR library. I have made Henkei v2.x behave the same as v1.x by making the loading of the OCR library opt in.

I've tried to disable the INFO message by specifying a Log4j configuration file (see below), however my knowledge of Log4j is limited, and in specifying the config file it appears to disable logging of any message. I don't think that is a good option as it would mute any "real" errors. I tried enabling log4j debugging which showed that the config was loading successfully, but still no output. Any input on how to do this "properly" would be appreciated!

java -Dlog4j.configurationFile=path/to/log4j-config.xml -Dlog4j2.debug=true -jar path/to/tika-app.jar .... etc etc ....