Closed DrWhax closed 4 years ago
Hi @DrWhax, you're right! It looks like we have a regression here @bamthomas.
Not sure how releases work exactly, but it seems it's still trying to OCR in Datashare 5.8.23, even though I explicitly disable it.
Hi @DrWhax, we are going to have a look into this. Thanks! Soline and @bamthomas.
@DrWhax just tried with the 5.9.20
and it worked.
Note that the following log :
WARNING: Tesseract OCR is installed and will be automatically applied to image files unless
you've excluded the TesseractOCRParser from the default parser.
Tesseract may dramatically slow down content extraction (TIKA-2359).
As of Tika 1.15 (and prior versions), Tesseract is automatically called.
In future versions of Tika, users may need to turn the TesseractOCRParser on via TikaConfig.
is displayed in both cases, because I think that Tika is detecting that the tesseract library is installed.
Nevertheless when doing indexing for ~10 docs :
datashare --ocr false -s SCAN,INDEX -m CLI
it takes 9s, whereas with OCR it is 39s and I see logs like :
Mar 31, 2020 3:24:12 PM org.apache.pdfbox.jbig2.util.log.JDKLogger error
SEVERE: No global segment added so far. Use JBIG2ImageReader.setGlobals().
That are not showed without OCR.
Could you try with last version ?
Hi! I think it's mostly resolved, I still see tesseract being busy, but I think with eml files and it might still try to extract embedded images and do something with them?
Any tips on how I would be able to confirm this for ya'll?
I've tried to do EML extraction with attachments (PDF and image inside PDF), when I disable OCR, tika finds the embedded files but does not analyze the image inside the PDF (so does not find the text). In that case, the eml file is parsed in few 10th of seconds.
When I do OCR, then it index the text of the image and the process is significantly longer.
With Datashare 7.1.4: if I start indexing without OCR from the web interface, OCR is performed anyway. Datashare is started with --ocr=true
(the default) which might be the cause of the confusion.
True ! On this line, even if ocr
option of optionsWrapper
is set to false
, ocr
option of properties
will be set to true
. @bamthomas or @mvanzalu any (good) reason why ?
@annelhote the funtion createMerged
on this line will only add options of optionsWrapper which aren't defined in properties. So if ocr option of properties is already set, it won't be modified.
Is that what we want ? The options coming from the front should override default options, right ?
I think so too, I'll take a look
Indeed there was something wrong, fix incoming
In version 4.21.0 and some versions before, I tried to index a bunch of data, but it's really slow, until I tried to see text in images, and I noticed it was trying to OCR images. But I said when indexing, please don't OCR.
I guess this is a bug?