Closed giancarlobi closed 3 years ago
Thank you for the bug report! Can you provide a sample page of your OCR? The way the markup is parsed has changed with the new version, we now use a proper XML parser instead of the previous state machine approach, it's likely that I missed something.
@jbaiter This is the content of field into Solr after ingesting, is this what you need?
<?xml version="1.0" encoding="UTF-8"?>
<ocr>
<p xml:id="sequence_3" wh="2479 3509">
<b>
<l>
<w x=".119 .045 .07 .011">Rapporto</w>
<w x=".195 .045 .06 .01">Tecnico,</w>
<w x=".262 .048 .056 .006">numero</w>
<w x=".323 .045 .011 .011">3,</w>
<w x=".34 .045 .052 .011">Agosto</w>
<w x=".397 .044 .037 .009">2016</w>
</l>
<l>
<w x=".134 .142 .106 .02">FABB</w>
<w x=".255 .142 .182 .027">Repository</w>
<w x=".45 .142 .049 .021">dal</w>
<w x=".511 .145 .138 .024">progetto</w>
<w x=".663 .142 .028 .021">al</w>
<w x=".702 .142 .161 .027">prototipo.</w>
</l>
<l>
<w x=".124 .177 .111 .02">Nuove</w>
<w x=".246 .176 .099 .021">forme</w>
<w x=".358 .176 .03 .021">di</w>
<w x=".401 .176 .247 .025">conservazione,</w>
<w x=".662 .176 .213 .021">condivisione</w>
</l>
<l>
<w x=".226 .217 .017 .014">e</w>
<w x=".255 .21 .243 .021">valorizzazione</w>
<w x=".51 .21 .03 .021">di</w>
<w x=".553 .217 .091 .02">opere</w>
<w x=".657 .21 .117 .027">digitali</w>
</l>
<l>
<w x=".263 .323 .095 .012">Giancarlo</w>
<w x=".365 .323 .069 .014">Birello,</w>
<w x=".442 .324 .052 .011">Ivano</w>
<w x=".502 .323 .063 .014">Fucile,</w>
<w x=".575 .323 .058 .012">Valter</w>
<w x=".639 .323 .097 .011">Giovanetti</w>
</l>
<l>
<w x=".411 .349 .093 .01">Ircres-CNR</w>
<w x=".512 .349 .053 .013">Ufficio</w>
<w x=".571 .349 .02 .009">IT</w>
</l>
<l>
<w x=".418 .365 .046 .009">Strada</w>
<w x=".47 .365 .035 .009">delle</w>
<w x=".51 .365 .048 .011">Cacce,</w>
<w x=".565 .365 .017 .009">73</w>
</l>
<l>
<w x=".432 .38 .043 .009">10135</w>
<w x=".481 .38 .05 .009">Torino</w>
<w x=".537 .38 .033 .012">Italy</w>
</l>
<l>
<w x=".44 .426 .05 .011">Anna</w>
<w x=".497 .426 .062 .011">Perin*</w>
</l>
<l>
<w x=".409 .452 .093 .01">Ircres-CNR</w>
<w x=".508 .451 .083 .01">Biblioteca</w>
</l>
<l>
<w x=".42 .468 .026 .009">Via</w>
<w x=".451 .468 .033 .009">Real</w>
<w x=".49 .468 .067 .012">Collegio,</w>
<w x=".563 .468 .017 .009">30</w>
</l>
<l>
<w x=".4 .483 .044 .009">10024</w>
<w x=".449 .483 .08 .009">Moncalieri</w>
<w x=".535 .483 .024 .009">TO</w>
<w x=".564 .483 .033 .012">Italy</w>
</l>
<l>
<w x=".119 .563 .109 .01">ABSTRACT:</w>
<w x=".236 .563 .051 .01">FABB</w>
<w x=".293 .563 .056 .013">project</w>
<w x=".355 .563 .066 .012">(Famine</w>
<w x=".428 .563 .028 .01">and</w>
<w x=".462 .564 .046 .011">Feast,</w>
<w x=".515 .564 .044 .01">Fame</w>
<w x=".565 .567 .008 .007">e</w>
<w x=".58 .563 .107 .012">Abbondanza)</w>
<w x=".693 .563 .026 .01">has</w>
<w x=".726 .563 .038 .01">been</w>
<w x=".769 .563 .086 .01">committed</w>
<w x=".861 .563 .02 .013">by</w>
</l>
<l>
<w x=".119 .58 .093 .01">Fondazione</w>
<w x=".221 .58 .042 .01">CRT.</w>
<w x=".273 .58 .034 .01">This</w>
<w x=".317 .58 .072 .01">technical</w>
<w x=".397 .581 .048 .011">report</w>
<w x=".453 .58 .067 .013">analyzes</w>
<w x=".53 .58 .024 .01">the</w>
<w x=".563 .58 .074 .013">strategies</w>
<w x=".647 .58 .063 .013">adopted</w>
<w x=".718 .58 .029 .01">and</w>
<w x=".755 .58 .024 .01">the</w>
<w x=".787 .58 .04 .01">main</w>
<w x=".835 .583 .045 .01">open-</w>
</l>
<l>
<w x=".12 .599 .051 .007">source</w>
<w x=".182 .596 .068 .01">software</w>
<w x=".259 .596 .041 .01">used.</w>
<w x=".311 .596 .093 .01">Ircres-CNR</w>
<w x=".414 .596 .026 .01">has</w>
<w x=".451 .596 .073 .013">deployed</w>
<w x=".534 .596 .024 .01">the</w>
<w x=".569 .596 .068 .01">software</w>
<w x=".647 .596 .028 .01">and</w>
<w x=".686 .599 .048 .007">server</w>
<w x=".743 .596 .076 .013">platforms</w>
<w x=".83 .596 .017 .01">of</w>
<w x=".856 .596 .024 .01">the</w>
</l>
<l>
<w x=".119 .613 .086 .013">repository,</w>
<w x=".216 .613 .015 .01">in</w>
<w x=".242 .616 .008 .007">a</w>
<w x=".261 .613 .086 .01">virtualized</w>
<w x=".358 .613 .028 .01">and</w>
<w x=".396 .613 .08 .01">redundant</w>
<w x=".487 .613 .113 .012">infrastructure,</w>
<w x=".611 .613 .011 .01">it</w>
<w x=".633 .613 .031 .01">also</w>
<w x=".675 .613 .033 .01">take</w>
<w x=".718 .616 .033 .007">care</w>
<w x=".762 .613 .017 .01">of</w>
<w x=".789 .613 .024 .01">the</w>
<w x=".823 .613 .056 .013">design,</w>
</l>
<l>
<w x=".119 .629 .104 .013">development</w>
<w x=".23 .629 .028 .01">and</w>
<w x=".264 .63 .103 .011">management</w>
<w x=".372 .629 .017 .01">of</w>
<w x=".395 .629 .024 .01">the</w>
<w x=".425 .629 .032 .01">web</w>
<w x=".464 .629 .046 .013">portal</w>
<w x=".517 .629 .087 .012">(front-end)</w>
<w x=".611 .629 .023 .01">for</w>
<w x=".64 .629 .024 .01">the</w>
<w x=".67 .629 .102 .013">presentation,</w>
<w x=".779 .629 .067 .01">research</w>
<w x=".852 .629 .028 .01">and</w>
</l>
<l>
<w x=".119 .645 .084 .013">consulting</w>
<w x=".209 .645 .033 .01">data</w>
<w x=".247 .645 .017 .01">of</w>
<w x=".269 .645 .024 .01">the</w>
<w x=".299 .645 .085 .013">digitalized</w>
<w x=".389 .645 .042 .01">items</w>
<w x=".438 .645 .054 .013">(lyrics,</w>
<w x=".499 .645 .044 .013">lyrics</w>
<w x=".549 .647 .034 .01">text,</w>
<w x=".589 .645 .088 .012">interviews,</w>
<w x=".683 .645 .052 .012">books,</w>
<w x=".741 .645 .063 .013">poems).</w>
</l>
<l>
<w x=".12 .695 .04 .009">KEY</w>
<w x=".165 .694 .077 .01">WORDS:</w>
<w x=".25 .698 .102 .01">open-source,</w>
<w x=".358 .694 .078 .012">islandora,</w>
<w x=".442 .694 .086 .013">repository,</w>
<w x=".534 .694 .05 .013">digital</w>
<w x=".591 .694 .063 .012">archive,</w>
<w x=".66 .694 .061 .01">cultural</w>
<w x=".726 .694 .064 .013">heritage</w>
</l>
<l>
<w x=".119 .744 .033 .01">JEL</w>
<w x=".157 .744 .069 .01">CODES:</w>
<w x=".234 .744 .03 .01">Z11</w>
</l>
<l>
<w x=".119 .864 .202 .001">____________________</w>
</l>
<l>
<w x=".12 .886 .119 .013">*Corresponding</w>
<w x=".244 .887 .05 .009">author:</w>
<w x=".302 .887 .178 .012">anna.perin@ircres.cnr.it</w>
</l>
</b>
</p>
</ocr>
Thanks !!!
Thank you, that helps a lot :-) It's likely a bug in the way implicit whitespace is handled when dealing with MiniOCR, will provide a fix tomorrow!
@jbaiter Great! so quick, I'm available to check the fix in our production deployment! Thanks.
@jbaiter thanks so much. You are just awesome 🥇
So I just built a testcase with the provided page, and for some reason I can't seem to reproduce the problem. For example, here's the snippet I get for the query "consulting data of the digitized items"
:
<lst>
<str name="text">repository, in a virtualized and redundant infrastructure, it also take care of the design, development and management of the web portal (front-end) for the presentation, research and <em>consulting data of the digitalized items</em> (lyrics, lyrics text, interviews, books, poems). KEY WORDS: open-source, islandora, repository, digital archive, cultural heritage JEL CODES: Z11</str>
<float name="score">1490.7888</float>
<arr name="pages">
<lst>
<str name="id">sequence_3</str>
<int name="width">2479</int>
<int name="height">3509</int>
</lst>
</arr>
<arr name="regions">
<lst>
<float name="ulx">0.119</float>
<float name="uly">0.613</float>
<float name="lrx">0.88</float>
<float name="lry">0.754</float>
<str name="text">repository, in a virtualized and redundant infrastructure, it also take care of the design, development and management of the web portal (front-end) for the presentation, research and <em>consulting data of the digitalized items</em> (lyrics, lyrics text, interviews, books, poems). KEY WORDS: open-source, islandora, repository, digital archive, cultural heritage JEL CODES: Z11</str>
<int name="pageIdx">0</int>
</lst>
</arr>
<arr name="highlights">
<arr>
<lst>
<int name="ulx">0</int>
<float name="uly">0.2269</float>
<float name="lrx">0.4099</float>
<float name="lry">0.3191</float>
<str name="text">consulting data of the digitalized items</str>
<int name="parentRegionIdx">0</int>
</lst>
</arr>
</arr>
</lst>
This tells me that the whitespace-handling from the OCR parser is correct for this file, since we find a match for the phrase.
Can you show how your index analysis pipeline is configured? I'm suspecting that this is probably related to how the tokenizer is configured.
I just noticed that the same schema works with 0.5.0, so this is really something that is in the plugin. Can you please provide a sample page for which you are certain that the problem is happening? E.g. a page where one of the terms/term sequences from your screenshot is occurring.
P.S.: If you're using MiniOCR to save on index space, you're leaving a few bytes on the table by not stripping the extraneous whitespace :-) The only whitespace that is needed is the one between the individual words, everything else is ignored anyway and takes up precious space. In your case, "minifying" the file would result in saving ~20% of the file size (7.1KiB vs 5.8KiB uncompressed). The practical impact is likely to be a lot smaller, though, since Lucene compresses segments with LZ4, but it's maybe something you might want to benchmark if space is of consideration for you.
@jbaiter thanks a lot, attached the page exactly as indexed in the screenshot above. Is this what you need? I will check the whole chain from page to miniOcr and report here, also @DiegoPino can add more details about this. Thanks !! pg_0003.pdf
@jbaiter When pdf searchable:
Thanks again
Sorry, I had a slight misunderstanding, I just noticed that the rapportotecnico
comes from the MiniOCR you posted, sorry!
This is very odd behavior, the MiniOCR looks fine, and I don't have any problems with bad tokenization in the unit test.
Could you maybe post your index analysis chain after all? Maybe there's some weird interplay with the tokenizer you're using (the test on my end uses the StandardTokenizerFactory
)
Do you mean this?:
<fieldType name="text_ocr_stored" class="solr.TextField" storeOffsetsWithPositions="true" termVectors="true">
<analyzer type="index">
<charFilter class="de.digitalcollections.solrocr.lucene.filters.OcrCharFilterFactory"/>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_und.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
</fieldType>
And are you using
<tokenizer class="solr.StandardTokenizerFactory"/>
instead of
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
?
Yes, exactly, thank you! Any reason you're using the WhitespaceTokenizerFactory
? This is usually only intended for highly structured content like keyword lists or similar things.
For natural language you'll probably want to use something else that is a bit smarter about things like punctuation.
@jbaiter Double thanks! I'll check it next hours. Really I don't remember why we are using WhitespaceTokenizerFactory
, @DiegoPino could add more info about this. Anyway, we have to check more deeper the right tokenizer, i.e. I see i have to remove punctuation signs also.
If you use something like the StandardTokenizer
, it will remove punctuation for you as part of the tokenization process :-)
@jbaiter I checked but also with StandardTokenizer I have the same issue (plugin 0.6.0): Could it depend on how MiniOCR is formatted? any idea to more check? Thanks.
I just found the issue, I mainly tested the new parser with external OCR sources, but in your case you're loading the OCR from the index itself! Will investigate and get back to you as soon as I've found a fix :-)
Nope :-(
Sorry, I was on a wrong trail this morning, it does not have to do with the external/stored state after all :-/ Could you do me a favor and paste the exact string value that you get back when you retrieve the document for the "numero 3" document from the index? I.e. the one you get from GET /solr/<collection>/select?id=<id>,fl=text_ocr_stored
@jbaiter I switched back to 0.5.0, does it matter?
I can extract both eventually
No, it shouldn't matter :-) Since you're storing the OCR in the index, the actual stored value is just whatever you posted to the collection when you indexed the document. The plugin version only plays a role afterwards, when the plugin indexes the OCR or highlights it. I want to make sure that the actual OCR that is stored in the index doesn't have any whitespace issues.
Here:
{
"response":{"numFound":1,"start":0,"numFoundExact":true,"docs":[
{
"tcocr_highlightm_X3b_und_ocr_text":["<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<ocr><p xml:id=\"sequence_3\" wh=\"2479 3509\"><b><l><w x=\".119 .045 .07 .011\">Rapporto</w><w x=\".195 .045 .06 .01\">Tecnico,</w><w x=\".262 .048 .056 .006\">numero</w><w x=\".323 .045 .011 .011\">3,</w><w x=\".34 .045 .052 .011\">Agosto</w><w x=\".397 .044 .037 .009\">2016</w></l><l><w x=\".134 .142 .106 .02\">FABB</w><w x=\".255 .142 .182 .027\">Repository</w><w x=\".45 .142 .049 .021\">dal</w><w x=\".511 .145 .138 .024\">progetto</w><w x=\".663 .142 .028 .021\">al</w><w x=\".702 .142 .161 .027\">prototipo.</w></l><l><w x=\".124 .177 .111 .02\">Nuove</w><w x=\".246 .176 .099 .021\">forme</w><w x=\".358 .176 .03 .021\">di</w><w x=\".401 .176 .247 .025\">conservazione,</w><w x=\".662 .176 .213 .021\">condivisione</w></l><l><w x=\".226 .217 .017 .014\">e</w><w x=\".255 .21 .243 .021\">valorizzazione</w><w x=\".51 .21 .03 .021\">di</w><w x=\".553 .217 .091 .02\">opere</w><w x=\".657 .21 .117 .027\">digitali</w></l><l><w x=\".263 .323 .095 .012\">Giancarlo</w><w x=\".365 .323 .069 .014\">Birello,</w><w x=\".442 .324 .052 .011\">Ivano</w><w x=\".502 .323 .063 .014\">Fucile,</w><w x=\".575 .323 .058 .012\">Valter</w><w x=\".639 .323 .097 .011\">Giovanetti</w></l><l><w x=\".411 .349 .093 .01\">Ircres-CNR</w><w x=\".512 .349 .053 .013\">Ufficio</w><w x=\".571 .349 .02 .009\">IT</w></l><l><w x=\".418 .365 .046 .009\">Strada</w><w x=\".47 .365 .035 .009\">delle</w><w x=\".51 .365 .048 .011\">Cacce,</w><w x=\".565 .365 .017 .009\">73</w></l><l><w x=\".432 .38 .043 .009\">10135</w><w x=\".481 .38 .05 .009\">Torino</w><w x=\".537 .38 .033 .012\">Italy</w></l><l><w x=\".44 .426 .05 .011\">Anna</w><w x=\".497 .426 .062 .011\">Perin*</w></l><l><w x=\".409 .452 .093 .01\">Ircres-CNR</w><w x=\".508 .451 .083 .01\">Biblioteca</w></l><l><w x=\".42 .468 .026 .009\">Via</w><w x=\".451 .468 .033 .009\">Real</w><w x=\".49 .468 .067 .012\">Collegio,</w><w x=\".563 .468 .017 .009\">30</w></l><l><w x=\".4 .483 .044 .009\">10024</w><w x=\".449 .483 .08 .009\">Moncalieri</w><w x=\".535 .483 .024 .009\">TO</w><w x=\".564 .483 .033 .012\">Italy</w></l><l><w x=\".119 .563 .109 .01\">ABSTRACT:</w><w x=\".236 .563 .051 .01\">FABB</w><w x=\".293 .563 .056 .013\">project</w><w x=\".355 .563 .066 .012\">(Famine</w><w x=\".428 .563 .028 .01\">and</w><w x=\".462 .564 .046 .011\">Feast,</w><w x=\".515 .564 .044 .01\">Fame</w><w x=\".565 .567 .008 .007\">e</w><w x=\".58 .563 .107 .012\">Abbondanza)</w><w x=\".693 .563 .026 .01\">has</w><w x=\".726 .563 .038 .01\">been</w><w x=\".769 .563 .086 .01\">committed</w><w x=\".861 .563 .02 .013\">by</w></l><l><w x=\".119 .58 .093 .01\">Fondazione</w><w x=\".221 .58 .042 .01\">CRT.</w><w x=\".273 .58 .034 .01\">This</w><w x=\".317 .58 .072 .01\">technical</w><w x=\".397 .581 .048 .011\">report</w><w x=\".453 .58 .067 .013\">analyzes</w><w x=\".53 .58 .024 .01\">the</w><w x=\".563 .58 .074 .013\">strategies</w><w x=\".647 .58 .063 .013\">adopted</w><w x=\".718 .58 .029 .01\">and</w><w x=\".755 .58 .024 .01\">the</w><w x=\".787 .58 .04 .01\">main</w><w x=\".835 .583 .045 .01\">open-</w></l><l><w x=\".12 .599 .051 .007\">source</w><w x=\".182 .596 .068 .01\">software</w><w x=\".259 .596 .041 .01\">used.</w><w x=\".311 .596 .093 .01\">Ircres-CNR</w><w x=\".414 .596 .026 .01\">has</w><w x=\".451 .596 .073 .013\">deployed</w><w x=\".534 .596 .024 .01\">the</w><w x=\".569 .596 .068 .01\">software</w><w x=\".647 .596 .028 .01\">and</w><w x=\".686 .599 .048 .007\">server</w><w x=\".743 .596 .076 .013\">platforms</w><w x=\".83 .596 .017 .01\">of</w><w x=\".856 .596 .024 .01\">the</w></l><l><w x=\".119 .613 .086 .013\">repository,</w><w x=\".216 .613 .015 .01\">in</w><w x=\".242 .616 .008 .007\">a</w><w x=\".261 .613 .086 .01\">virtualized</w><w x=\".358 .613 .028 .01\">and</w><w x=\".396 .613 .08 .01\">redundant</w><w x=\".487 .613 .113 .012\">infrastructure,</w><w x=\".611 .613 .011 .01\">it</w><w x=\".633 .613 .031 .01\">also</w><w x=\".675 .613 .033 .01\">take</w><w x=\".718 .616 .033 .007\">care</w><w x=\".762 .613 .017 .01\">of</w><w x=\".789 .613 .024 .01\">the</w><w x=\".823 .613 .056 .013\">design,</w></l><l><w x=\".119 .629 .104 .013\">development</w><w x=\".23 .629 .028 .01\">and</w><w x=\".264 .63 .103 .011\">management</w><w x=\".372 .629 .017 .01\">of</w><w x=\".395 .629 .024 .01\">the</w><w x=\".425 .629 .032 .01\">web</w><w x=\".464 .629 .046 .013\">portal</w><w x=\".517 .629 .087 .012\">(front-end)</w><w x=\".611 .629 .023 .01\">for</w><w x=\".64 .629 .024 .01\">the</w><w x=\".67 .629 .102 .013\">presentation,</w><w x=\".779 .629 .067 .01\">research</w><w x=\".852 .629 .028 .01\">and</w></l><l><w x=\".119 .645 .084 .013\">consulting</w><w x=\".209 .645 .033 .01\">data</w><w x=\".247 .645 .017 .01\">of</w><w x=\".269 .645 .024 .01\">the</w><w x=\".299 .645 .085 .013\">digitalized</w><w x=\".389 .645 .042 .01\">items</w><w x=\".438 .645 .054 .013\">(lyrics,</w><w x=\".499 .645 .044 .013\">lyrics</w><w x=\".549 .647 .034 .01\">text,</w><w x=\".589 .645 .088 .012\">interviews,</w><w x=\".683 .645 .052 .012\">books,</w><w x=\".741 .645 .063 .013\">poems).</w></l><l><w x=\".12 .695 .04 .009\">KEY</w><w x=\".165 .694 .077 .01\">WORDS:</w><w x=\".25 .698 .102 .01\">open-source,</w><w x=\".358 .694 .078 .012\">islandora,</w><w x=\".442 .694 .086 .013\">repository,</w><w x=\".534 .694 .05 .013\">digital</w><w x=\".591 .694 .063 .012\">archive,</w><w x=\".66 .694 .061 .01\">cultural</w><w x=\".726 .694 .064 .013\">heritage</w></l><l><w x=\".119 .744 .033 .01\">JEL</w><w x=\".157 .744 .069 .01\">CODES:</w><w x=\".234 .744 .03 .01\">Z11</w></l><l><w x=\".119 .864 .202 .001\">____________________</w></l><l><w x=\".12 .886 .119 .013\">*Corresponding</w><w x=\".244 .887 .05 .009\">author:</w><w x=\".302 .887 .178 .012\">anna.perin@ircres.cnr.it</w></l></b></p></ocr>"]}]
}}
There you go, the OCR that you feed to the index does not have any whitespace between the words!
The plugin relies on the whitespace in the OCR when parsing it, i.e. <w ...>hello</w><w>world</w>
will parse to helloworld
. Make sure you don't throw away the whitespace between the ocrx_word
spans that you get back from djvu2hocr
.
@jbaiter a last question (I hope) why that happens with 0.6.0 and not with 0.5.0? Anyway thanks really a lot
Good question! The 0.5.0 code wrapped Lucene's HTMLStripCharFilter
. This filter outputs a lot of extra whitespace/newlines between node texts.
For example, this is what your whitespace-less document looked like after being run through the HTMLStripCharFilter
:
The new parser only outputs whatever whitespace there is in the input document (and normalizes runs of consecutive spaces to a single space character to deal with indentation). If there is no whitespace in the input document, the parsed text will not have any whitespace either.
Thanks a lot for your time on this, have a nice evening!!! Take into account, here there is a really good bottle of wine waiting for you for when come to Italy!!
@jbaiter I was trying to compile from main your plugin (resulting in a 0.6.0-SNAPSHOT) and installed over a Solr 8.8.1. I found that words are indexed without space between, like this I switched back to 0.5.0 without change anything and the right indexing happens: Have you any notes about this? I missed some new configuration parameters? Thanks for your fantastic plugin