ICIJ / datashare

A self-hosted search engine for documents.
https://datashare.icij.org
GNU Affero General Public License v3.0
596 stars 53 forks source link

DataShare will OCR even when you tell it not to #336

Closed DrWhax closed 4 years ago

DrWhax commented 4 years ago

In version 4.21.0 and some versions before, I tried to index a bunch of data, but it's really slow, until I tried to see text in images, and I noticed it was trying to OCR images. But I said when indexing, please don't OCR.

I guess this is a bug?

pirhoo commented 4 years ago

Hi @DrWhax, you're right! It looks like we have a regression here @bamthomas.

DrWhax commented 4 years ago

Not sure how releases work exactly, but it seems it's still trying to OCR in Datashare 5.8.23, even though I explicitly disable it.

Soliine commented 4 years ago

Hi @DrWhax, we are going to have a look into this. Thanks! Soline and @bamthomas.

bamthomas commented 4 years ago

@DrWhax just tried with the 5.9.20 and it worked.

Note that the following log :

WARNING: Tesseract OCR is installed and will be automatically applied to image files unless
you've excluded the TesseractOCRParser from the default parser.
Tesseract may dramatically slow down content extraction (TIKA-2359).
As of Tika 1.15 (and prior versions), Tesseract is automatically called.
In future versions of Tika, users may need to turn the TesseractOCRParser on via TikaConfig.

is displayed in both cases, because I think that Tika is detecting that the tesseract library is installed.

Nevertheless when doing indexing for ~10 docs :

datashare --ocr false -s SCAN,INDEX -m CLI

it takes 9s, whereas with OCR it is 39s and I see logs like :

Mar 31, 2020 3:24:12 PM org.apache.pdfbox.jbig2.util.log.JDKLogger error
SEVERE: No global segment added so far. Use JBIG2ImageReader.setGlobals().

That are not showed without OCR.

Could you try with last version ?

DrWhax commented 4 years ago

Hi! I think it's mostly resolved, I still see tesseract being busy, but I think with eml files and it might still try to extract embedded images and do something with them?

Any tips on how I would be able to confirm this for ya'll?

bamthomas commented 4 years ago

I've tried to do EML extraction with attachments (PDF and image inside PDF), when I disable OCR, tika finds the embedded files but does not analyze the image inside the PDF (so does not find the text). In that case, the eml file is parsed in few 10th of seconds.

When I do OCR, then it index the text of the image and the process is significantly longer.

pirhoo commented 4 years ago

With Datashare 7.1.4: if I start indexing without OCR from the web interface, OCR is performed anyway. Datashare is started with --ocr=true (the default) which might be the cause of the confusion.

annelhote commented 4 years ago

True ! On this line, even if ocr option of optionsWrapper is set to false, ocr option of properties will be set to true. @bamthomas or @mvanzalu any (good) reason why ?

mvanzalu commented 4 years ago

@annelhote the funtion createMerged on this line will only add options of optionsWrapper which aren't defined in properties. So if ocr option of properties is already set, it won't be modified.

annelhote commented 4 years ago

Is that what we want ? The options coming from the front should override default options, right ?

mvanzalu commented 4 years ago

I think so too, I'll take a look

mvanzalu commented 4 years ago

Indeed there was something wrong, fix incoming