ICIJ / node-tika

Apache Tika bridge for Node.js. Text and metadata extraction, language detection and more.
MIT License
138 stars 36 forks source link

Upgrade to Tika v1.20 #23

Open pratheekrebala opened 6 years ago

pratheekrebala commented 6 years ago

All tests are passing.

chriszs commented 6 years ago

Just noticed some added warnings:

SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
Jan 19, 2018 3:36:07 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed.
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.

Jan 19, 2018 3:36:07 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
WARNING: Tesseract OCR is installed and will be automatically applied to image files unless
you've excluded the TesseractOCRParser from the default parser.
Tesseract may dramatically slow down content extraction (TIKA-2359).
As of Tika 1.15 (and prior versions), Tesseract is automatically called.
In future versions of Tika, users may need to turn the TesseractOCRParser on via TikaConfig.
Jan 19, 2018 3:36:07 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
WARNING: org.xerial's sqlite-jdbc is not loaded.
Please provide the jar on your classpath to parse sqlite files.
See tika-parsers/pom.xml for the correct version.

See this discussion.

Looks like we could either turn these warnings off in tika-config.xml using Tim's patch and/or bundle some of the missing dependencies like J2KImageReader. Node users probably aren't going to want to deal with Java warnings on a package they install, so we may want to make some sane choices for them and turn the warnings off.

Looks like JBIG2 ImageIO was made optional due to license incompatibility, but has since been re-licensed to be compatible.

I'm a little unclear on the need for StaticLoggerBinder, does that get used purely for development logging or does it communicate potentially important errors?

We could almost certainly turn the Tesseract warning off in this version, though it will in some future Tika version indicate Tesseract is turned off by default.

I have no idea if bundling SQLite compatibility by default is a good idea or not. The stated reason for making it optional is "potential conflicts of native libraries in web servers."

pratheekrebala commented 6 years ago
pratheekrebala commented 5 years ago
pratheekrebala commented 5 years ago

@mattcg Would it be possible to merge this PR? The test that fails is a timeout when Travis is trying to fetch the FTP file. It seems to be unrelated to this package.

Thank you! 🙏 🙏 🙏