abrom / henkei

Read text and metadata from files and documents (.doc, .docx, .pages, .odt, .rtf, .pdf)
http://github.com/abrom/henkei
MIT License
74 stars 14 forks source link

Suppress INFO messages? #3

Open FinnWoelm opened 5 years ago

FinnWoelm commented 5 years ago

Hi there,

First of all: Thank you for forking Yomu and bringing it back alive. Absolutely amazing work.

Second: Any idea on how I might suppress INFO messages from showing up? These occur when I'm parsing a PDF document. My Rails logger is set to warning, but I'm guessing these show because they're coming directly from Apache Tika.

INFO  To get higher rendering speed on JDK8 or later,
INFO    use the option -Dsun.java2d.cmm=sun.java2d.cmm.kcms.KcmsServiceProvider
INFO    or call System.setProperty("sun.java2d.cmm", "sun.java2d.cmm.kcms.KcmsServiceProvider")

Cheers, Finn

abrom commented 5 years ago

Yeah, unfortunately that's coming from PDFBox - as used by Tika (due to a change in Java 8 where the default is to use LittleCMS instead of KCMS). According to their own documentation:

KCMS is the unmaintained, legacy provider and is far faster than the newer replacement.
However, there are stability and security risks with using the unmaintained legacy provider.

So why they feel it necessary to spout all of that 'information' about it is beyond me.

The info itself is coming from: https://github.com/apache/pdfbox/blob/f83bcc1fe60502759024a3b51983b29c7de66327/pdfbox/src/main/java/org/apache/pdfbox/rendering/PDFRenderer.java#L394

I did look into it a while back, and as far as I could tell there wasn't really a nice way to suppress this info (and not end up suppressing ALL info). I've just been putting up with it.

If you feel the need, you can overload the config for the pdfbox logger and pipe it to somewhere else.

FinnWoelm commented 5 years ago

Thanks for the fast reply! Much appreciated!

You having used Apache Tika much longer than I have, do you think I would be losing anything of importance if I decided to filter out all 'INFO' messages by filtering the return of io.read?

I would imagine any issues of crucial concern would have an ERROR or WARNING status. It could even become a Henkei setting, e.g. Henkei.log_info = true/false.

abrom commented 5 years ago

Hmm that sounds a bit dangerous (ie you could filter out non-info things you didn't mean to). I would think the more reliable solution would be to overload the config for the pdfbox logger to simply change the logger level.

FinnWoelm commented 5 years ago

Hmmm, fair enough.

I'm pretty unfamiliar with Java, that's why I tried to avoid having to touch the pdfbox logger config :sweat_smile: Is that something I would do in jar/tika-config.xml?

abrom commented 5 years ago

It's been a while since I've looked at Java.

The pdfbox library uses the Apache Commons Logging library so I think that'd be the place to start: https://commons.apache.org/proper/commons-logging/guide.html#Quick_Start

It appears to be more of a wrapper for other logging systems and I have no idea which one that actually would be. It seems like it depends on what you have installed