dbmdz / solr-ocrhighlighting

Highlighting various OCR formats directly in Solr
https://dbmdz.github.io/solr-ocrhighlighting
MIT License
84 stars 13 forks source link

No blank between words with 0.6.0 compiled from main #147

Closed giancarlobi closed 3 years ago

giancarlobi commented 3 years ago

@jbaiter I was trying to compile from main your plugin (resulting in a 0.6.0-SNAPSHOT) and installed over a Solr 8.8.1. I found that words are indexed without space between, like this image I switched back to 0.5.0 without change anything and the right indexing happens: image Have you any notes about this? I missed some new configuration parameters? Thanks for your fantastic plugin

jbaiter commented 3 years ago

Thank you for the bug report! Can you provide a sample page of your OCR? The way the markup is parsed has changed with the new version, we now use a proper XML parser instead of the previous state machine approach, it's likely that I missed something.

giancarlobi commented 3 years ago

@jbaiter This is the content of field into Solr after ingesting, is this what you need?

<?xml version="1.0" encoding="UTF-8"?>
<ocr>
   <p xml:id="sequence_3" wh="2479 3509">
      <b>
         <l>
            <w x=".119 .045 .07 .011">Rapporto</w>
            <w x=".195 .045 .06 .01">Tecnico,</w>
            <w x=".262 .048 .056 .006">numero</w>
            <w x=".323 .045 .011 .011">3,</w>
            <w x=".34 .045 .052 .011">Agosto</w>
            <w x=".397 .044 .037 .009">2016</w>
         </l>
         <l>
            <w x=".134 .142 .106 .02">FABB</w>
            <w x=".255 .142 .182 .027">Repository</w>
            <w x=".45 .142 .049 .021">dal</w>
            <w x=".511 .145 .138 .024">progetto</w>
            <w x=".663 .142 .028 .021">al</w>
            <w x=".702 .142 .161 .027">prototipo.</w>
         </l>
         <l>
            <w x=".124 .177 .111 .02">Nuove</w>
            <w x=".246 .176 .099 .021">forme</w>
            <w x=".358 .176 .03 .021">di</w>
            <w x=".401 .176 .247 .025">conservazione,</w>
            <w x=".662 .176 .213 .021">condivisione</w>
         </l>
         <l>
            <w x=".226 .217 .017 .014">e</w>
            <w x=".255 .21 .243 .021">valorizzazione</w>
            <w x=".51 .21 .03 .021">di</w>
            <w x=".553 .217 .091 .02">opere</w>
            <w x=".657 .21 .117 .027">digitali</w>
         </l>
         <l>
            <w x=".263 .323 .095 .012">Giancarlo</w>
            <w x=".365 .323 .069 .014">Birello,</w>
            <w x=".442 .324 .052 .011">Ivano</w>
            <w x=".502 .323 .063 .014">Fucile,</w>
            <w x=".575 .323 .058 .012">Valter</w>
            <w x=".639 .323 .097 .011">Giovanetti</w>
         </l>
         <l>
            <w x=".411 .349 .093 .01">Ircres-CNR</w>
            <w x=".512 .349 .053 .013">Ufficio</w>
            <w x=".571 .349 .02 .009">IT</w>
         </l>
         <l>
            <w x=".418 .365 .046 .009">Strada</w>
            <w x=".47 .365 .035 .009">delle</w>
            <w x=".51 .365 .048 .011">Cacce,</w>
            <w x=".565 .365 .017 .009">73</w>
         </l>
         <l>
            <w x=".432 .38 .043 .009">10135</w>
            <w x=".481 .38 .05 .009">Torino</w>
            <w x=".537 .38 .033 .012">Italy</w>
         </l>
         <l>
            <w x=".44 .426 .05 .011">Anna</w>
            <w x=".497 .426 .062 .011">Perin*</w>
         </l>
         <l>
            <w x=".409 .452 .093 .01">Ircres-CNR</w>
            <w x=".508 .451 .083 .01">Biblioteca</w>
         </l>
         <l>
            <w x=".42 .468 .026 .009">Via</w>
            <w x=".451 .468 .033 .009">Real</w>
            <w x=".49 .468 .067 .012">Collegio,</w>
            <w x=".563 .468 .017 .009">30</w>
         </l>
         <l>
            <w x=".4 .483 .044 .009">10024</w>
            <w x=".449 .483 .08 .009">Moncalieri</w>
            <w x=".535 .483 .024 .009">TO</w>
            <w x=".564 .483 .033 .012">Italy</w>
         </l>
         <l>
            <w x=".119 .563 .109 .01">ABSTRACT:</w>
            <w x=".236 .563 .051 .01">FABB</w>
            <w x=".293 .563 .056 .013">project</w>
            <w x=".355 .563 .066 .012">(Famine</w>
            <w x=".428 .563 .028 .01">and</w>
            <w x=".462 .564 .046 .011">Feast,</w>
            <w x=".515 .564 .044 .01">Fame</w>
            <w x=".565 .567 .008 .007">e</w>
            <w x=".58 .563 .107 .012">Abbondanza)</w>
            <w x=".693 .563 .026 .01">has</w>
            <w x=".726 .563 .038 .01">been</w>
            <w x=".769 .563 .086 .01">committed</w>
            <w x=".861 .563 .02 .013">by</w>
         </l>
         <l>
            <w x=".119 .58 .093 .01">Fondazione</w>
            <w x=".221 .58 .042 .01">CRT.</w>
            <w x=".273 .58 .034 .01">This</w>
            <w x=".317 .58 .072 .01">technical</w>
            <w x=".397 .581 .048 .011">report</w>
            <w x=".453 .58 .067 .013">analyzes</w>
            <w x=".53 .58 .024 .01">the</w>
            <w x=".563 .58 .074 .013">strategies</w>
            <w x=".647 .58 .063 .013">adopted</w>
            <w x=".718 .58 .029 .01">and</w>
            <w x=".755 .58 .024 .01">the</w>
            <w x=".787 .58 .04 .01">main</w>
            <w x=".835 .583 .045 .01">open-</w>
         </l>
         <l>
            <w x=".12 .599 .051 .007">source</w>
            <w x=".182 .596 .068 .01">software</w>
            <w x=".259 .596 .041 .01">used.</w>
            <w x=".311 .596 .093 .01">Ircres-CNR</w>
            <w x=".414 .596 .026 .01">has</w>
            <w x=".451 .596 .073 .013">deployed</w>
            <w x=".534 .596 .024 .01">the</w>
            <w x=".569 .596 .068 .01">software</w>
            <w x=".647 .596 .028 .01">and</w>
            <w x=".686 .599 .048 .007">server</w>
            <w x=".743 .596 .076 .013">platforms</w>
            <w x=".83 .596 .017 .01">of</w>
            <w x=".856 .596 .024 .01">the</w>
         </l>
         <l>
            <w x=".119 .613 .086 .013">repository,</w>
            <w x=".216 .613 .015 .01">in</w>
            <w x=".242 .616 .008 .007">a</w>
            <w x=".261 .613 .086 .01">virtualized</w>
            <w x=".358 .613 .028 .01">and</w>
            <w x=".396 .613 .08 .01">redundant</w>
            <w x=".487 .613 .113 .012">infrastructure,</w>
            <w x=".611 .613 .011 .01">it</w>
            <w x=".633 .613 .031 .01">also</w>
            <w x=".675 .613 .033 .01">take</w>
            <w x=".718 .616 .033 .007">care</w>
            <w x=".762 .613 .017 .01">of</w>
            <w x=".789 .613 .024 .01">the</w>
            <w x=".823 .613 .056 .013">design,</w>
         </l>
         <l>
            <w x=".119 .629 .104 .013">development</w>
            <w x=".23 .629 .028 .01">and</w>
            <w x=".264 .63 .103 .011">management</w>
            <w x=".372 .629 .017 .01">of</w>
            <w x=".395 .629 .024 .01">the</w>
            <w x=".425 .629 .032 .01">web</w>
            <w x=".464 .629 .046 .013">portal</w>
            <w x=".517 .629 .087 .012">(front-end)</w>
            <w x=".611 .629 .023 .01">for</w>
            <w x=".64 .629 .024 .01">the</w>
            <w x=".67 .629 .102 .013">presentation,</w>
            <w x=".779 .629 .067 .01">research</w>
            <w x=".852 .629 .028 .01">and</w>
         </l>
         <l>
            <w x=".119 .645 .084 .013">consulting</w>
            <w x=".209 .645 .033 .01">data</w>
            <w x=".247 .645 .017 .01">of</w>
            <w x=".269 .645 .024 .01">the</w>
            <w x=".299 .645 .085 .013">digitalized</w>
            <w x=".389 .645 .042 .01">items</w>
            <w x=".438 .645 .054 .013">(lyrics,</w>
            <w x=".499 .645 .044 .013">lyrics</w>
            <w x=".549 .647 .034 .01">text,</w>
            <w x=".589 .645 .088 .012">interviews,</w>
            <w x=".683 .645 .052 .012">books,</w>
            <w x=".741 .645 .063 .013">poems).</w>
         </l>
         <l>
            <w x=".12 .695 .04 .009">KEY</w>
            <w x=".165 .694 .077 .01">WORDS:</w>
            <w x=".25 .698 .102 .01">open-source,</w>
            <w x=".358 .694 .078 .012">islandora,</w>
            <w x=".442 .694 .086 .013">repository,</w>
            <w x=".534 .694 .05 .013">digital</w>
            <w x=".591 .694 .063 .012">archive,</w>
            <w x=".66 .694 .061 .01">cultural</w>
            <w x=".726 .694 .064 .013">heritage</w>
         </l>
         <l>
            <w x=".119 .744 .033 .01">JEL</w>
            <w x=".157 .744 .069 .01">CODES:</w>
            <w x=".234 .744 .03 .01">Z11</w>
         </l>
         <l>
            <w x=".119 .864 .202 .001">____________________</w>
         </l>
         <l>
            <w x=".12 .886 .119 .013">*Corresponding</w>
            <w x=".244 .887 .05 .009">author:</w>
            <w x=".302 .887 .178 .012">anna.perin@ircres.cnr.it</w>
         </l>
      </b>
   </p>
</ocr>

Thanks !!!

jbaiter commented 3 years ago

Thank you, that helps a lot :-) It's likely a bug in the way implicit whitespace is handled when dealing with MiniOCR, will provide a fix tomorrow!

giancarlobi commented 3 years ago

@jbaiter Great! so quick, I'm available to check the fix in our production deployment! Thanks.

DiegoPino commented 3 years ago

@jbaiter thanks so much. You are just awesome 🥇

jbaiter commented 3 years ago

So I just built a testcase with the provided page, and for some reason I can't seem to reproduce the problem. For example, here's the snippet I get for the query "consulting data of the digitized items":

<lst>
          <str name="text">repository, in a virtualized and redundant infrastructure, it also take care of the design, development and management of the web portal (front-end) for the presentation, research and &lt;em&gt;consulting data of the digitalized items&lt;/em&gt; (lyrics, lyrics text, interviews, books, poems). KEY WORDS: open-source, islandora, repository, digital archive, cultural heritage JEL CODES: Z11</str>
          <float name="score">1490.7888</float>
          <arr name="pages">
            <lst>
              <str name="id">sequence_3</str>
              <int name="width">2479</int>
              <int name="height">3509</int>
            </lst>
          </arr>
          <arr name="regions">
            <lst>
              <float name="ulx">0.119</float>
              <float name="uly">0.613</float>
              <float name="lrx">0.88</float>
              <float name="lry">0.754</float>
              <str name="text">repository, in a virtualized and redundant infrastructure, it also take care of the design, development and management of the web portal (front-end) for the presentation, research and &lt;em&gt;consulting data of the digitalized items&lt;/em&gt; (lyrics, lyrics text, interviews, books, poems). KEY WORDS: open-source, islandora, repository, digital archive, cultural heritage JEL CODES: Z11</str>
              <int name="pageIdx">0</int>
            </lst>
          </arr>
          <arr name="highlights">
            <arr>
              <lst>
                <int name="ulx">0</int>
                <float name="uly">0.2269</float>
                <float name="lrx">0.4099</float>
                <float name="lry">0.3191</float>
                <str name="text">consulting data of the digitalized items</str>
                <int name="parentRegionIdx">0</int>
              </lst>
            </arr>
          </arr>
        </lst>

This tells me that the whitespace-handling from the OCR parser is correct for this file, since we find a match for the phrase.

Can you show how your index analysis pipeline is configured? I'm suspecting that this is probably related to how the tokenizer is configured.

I just noticed that the same schema works with 0.5.0, so this is really something that is in the plugin. Can you please provide a sample page for which you are certain that the problem is happening? E.g. a page where one of the terms/term sequences from your screenshot is occurring.

P.S.: If you're using MiniOCR to save on index space, you're leaving a few bytes on the table by not stripping the extraneous whitespace :-) The only whitespace that is needed is the one between the individual words, everything else is ignored anyway and takes up precious space. In your case, "minifying" the file would result in saving ~20% of the file size (7.1KiB vs 5.8KiB uncompressed). The practical impact is likely to be a lot smaller, though, since Lucene compresses segments with LZ4, but it's maybe something you might want to benchmark if space is of consideration for you.

giancarlobi commented 3 years ago

@jbaiter thanks a lot, attached the page exactly as indexed in the screenshot above. Is this what you need? I will check the whole chain from page to miniOcr and report here, also @DiegoPino can add more details about this. Thanks !! pg_0003.pdf

giancarlobi commented 3 years ago

@jbaiter When pdf searchable:

  1. we use djvu2hocr to convert single page to hocr
  2. as djvu2hocr output uses ocrx_line while tesseract uses ocr_line we replace ocrx-line with ocr_line
  3. finally calling this function to convert to miniOCR: https://github.com/esmero/strawberry_runners/blob/9e6fc38e0f4c48e6a84bef5214b519f776b877b6/src/Plugin/StrawberryRunnersPostProcessor/OcrPostProcessor.php#L434

Thanks again

jbaiter commented 3 years ago

Sorry, I had a slight misunderstanding, I just noticed that the rapportotecnico comes from the MiniOCR you posted, sorry! This is very odd behavior, the MiniOCR looks fine, and I don't have any problems with bad tokenization in the unit test. Could you maybe post your index analysis chain after all? Maybe there's some weird interplay with the tokenizer you're using (the test on my end uses the StandardTokenizerFactory)

giancarlobi commented 3 years ago

Do you mean this?:

<fieldType name="text_ocr_stored" class="solr.TextField" storeOffsetsWithPositions="true" termVectors="true">
  <analyzer type="index">
    <charFilter class="de.digitalcollections.solrocr.lucene.filters.OcrCharFilterFactory"/>
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_und.txt"/>
    <filter class="solr.PorterStemFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.StopFilterFactory"/>
    <filter class="solr.PorterStemFilterFactory"/>
  </analyzer>
</fieldType>
giancarlobi commented 3 years ago

And are you using <tokenizer class="solr.StandardTokenizerFactory"/> instead of <tokenizer class="solr.WhitespaceTokenizerFactory"/> ?

jbaiter commented 3 years ago

Yes, exactly, thank you! Any reason you're using the WhitespaceTokenizerFactory? This is usually only intended for highly structured content like keyword lists or similar things. For natural language you'll probably want to use something else that is a bit smarter about things like punctuation.

giancarlobi commented 3 years ago

@jbaiter Double thanks! I'll check it next hours. Really I don't remember why we are using WhitespaceTokenizerFactory , @DiegoPino could add more info about this. Anyway, we have to check more deeper the right tokenizer, i.e. I see i have to remove punctuation signs also.

jbaiter commented 3 years ago

If you use something like the StandardTokenizer, it will remove punctuation for you as part of the tokenization process :-)

giancarlobi commented 3 years ago

@jbaiter I checked but also with StandardTokenizer I have the same issue (plugin 0.6.0): image Could it depend on how MiniOCR is formatted? any idea to more check? Thanks.

jbaiter commented 3 years ago

I just found the issue, I mainly tested the new parser with external OCR sources, but in your case you're loading the OCR from the index itself! Will investigate and get back to you as soon as I've found a fix :-)

Nope :-(

jbaiter commented 3 years ago

Sorry, I was on a wrong trail this morning, it does not have to do with the external/stored state after all :-/ Could you do me a favor and paste the exact string value that you get back when you retrieve the document for the "numero 3" document from the index? I.e. the one you get from GET /solr/<collection>/select?id=<id>,fl=text_ocr_stored

giancarlobi commented 3 years ago

@jbaiter I switched back to 0.5.0, does it matter?

giancarlobi commented 3 years ago

I can extract both eventually

jbaiter commented 3 years ago

No, it shouldn't matter :-) Since you're storing the OCR in the index, the actual stored value is just whatever you posted to the collection when you indexed the document. The plugin version only plays a role afterwards, when the plugin indexes the OCR or highlights it. I want to make sure that the actual OCR that is stored in the index doesn't have any whitespace issues.

giancarlobi commented 3 years ago

Here:

{
  "response":{"numFound":1,"start":0,"numFoundExact":true,"docs":[
      {
        "tcocr_highlightm_X3b_und_ocr_text":["<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<ocr><p xml:id=\"sequence_3\" wh=\"2479 3509\"><b><l><w x=\".119 .045 .07 .011\">Rapporto</w><w x=\".195 .045 .06 .01\">Tecnico,</w><w x=\".262 .048 .056 .006\">numero</w><w x=\".323 .045 .011 .011\">3,</w><w x=\".34 .045 .052 .011\">Agosto</w><w x=\".397 .044 .037 .009\">2016</w></l><l><w x=\".134 .142 .106 .02\">FABB</w><w x=\".255 .142 .182 .027\">Repository</w><w x=\".45 .142 .049 .021\">dal</w><w x=\".511 .145 .138 .024\">progetto</w><w x=\".663 .142 .028 .021\">al</w><w x=\".702 .142 .161 .027\">prototipo.</w></l><l><w x=\".124 .177 .111 .02\">Nuove</w><w x=\".246 .176 .099 .021\">forme</w><w x=\".358 .176 .03 .021\">di</w><w x=\".401 .176 .247 .025\">conservazione,</w><w x=\".662 .176 .213 .021\">condivisione</w></l><l><w x=\".226 .217 .017 .014\">e</w><w x=\".255 .21 .243 .021\">valorizzazione</w><w x=\".51 .21 .03 .021\">di</w><w x=\".553 .217 .091 .02\">opere</w><w x=\".657 .21 .117 .027\">digitali</w></l><l><w x=\".263 .323 .095 .012\">Giancarlo</w><w x=\".365 .323 .069 .014\">Birello,</w><w x=\".442 .324 .052 .011\">Ivano</w><w x=\".502 .323 .063 .014\">Fucile,</w><w x=\".575 .323 .058 .012\">Valter</w><w x=\".639 .323 .097 .011\">Giovanetti</w></l><l><w x=\".411 .349 .093 .01\">Ircres-CNR</w><w x=\".512 .349 .053 .013\">Ufficio</w><w x=\".571 .349 .02 .009\">IT</w></l><l><w x=\".418 .365 .046 .009\">Strada</w><w x=\".47 .365 .035 .009\">delle</w><w x=\".51 .365 .048 .011\">Cacce,</w><w x=\".565 .365 .017 .009\">73</w></l><l><w x=\".432 .38 .043 .009\">10135</w><w x=\".481 .38 .05 .009\">Torino</w><w x=\".537 .38 .033 .012\">Italy</w></l><l><w x=\".44 .426 .05 .011\">Anna</w><w x=\".497 .426 .062 .011\">Perin*</w></l><l><w x=\".409 .452 .093 .01\">Ircres-CNR</w><w x=\".508 .451 .083 .01\">Biblioteca</w></l><l><w x=\".42 .468 .026 .009\">Via</w><w x=\".451 .468 .033 .009\">Real</w><w x=\".49 .468 .067 .012\">Collegio,</w><w x=\".563 .468 .017 .009\">30</w></l><l><w x=\".4 .483 .044 .009\">10024</w><w x=\".449 .483 .08 .009\">Moncalieri</w><w x=\".535 .483 .024 .009\">TO</w><w x=\".564 .483 .033 .012\">Italy</w></l><l><w x=\".119 .563 .109 .01\">ABSTRACT:</w><w x=\".236 .563 .051 .01\">FABB</w><w x=\".293 .563 .056 .013\">project</w><w x=\".355 .563 .066 .012\">(Famine</w><w x=\".428 .563 .028 .01\">and</w><w x=\".462 .564 .046 .011\">Feast,</w><w x=\".515 .564 .044 .01\">Fame</w><w x=\".565 .567 .008 .007\">e</w><w x=\".58 .563 .107 .012\">Abbondanza)</w><w x=\".693 .563 .026 .01\">has</w><w x=\".726 .563 .038 .01\">been</w><w x=\".769 .563 .086 .01\">committed</w><w x=\".861 .563 .02 .013\">by</w></l><l><w x=\".119 .58 .093 .01\">Fondazione</w><w x=\".221 .58 .042 .01\">CRT.</w><w x=\".273 .58 .034 .01\">This</w><w x=\".317 .58 .072 .01\">technical</w><w x=\".397 .581 .048 .011\">report</w><w x=\".453 .58 .067 .013\">analyzes</w><w x=\".53 .58 .024 .01\">the</w><w x=\".563 .58 .074 .013\">strategies</w><w x=\".647 .58 .063 .013\">adopted</w><w x=\".718 .58 .029 .01\">and</w><w x=\".755 .58 .024 .01\">the</w><w x=\".787 .58 .04 .01\">main</w><w x=\".835 .583 .045 .01\">open-</w></l><l><w x=\".12 .599 .051 .007\">source</w><w x=\".182 .596 .068 .01\">software</w><w x=\".259 .596 .041 .01\">used.</w><w x=\".311 .596 .093 .01\">Ircres-CNR</w><w x=\".414 .596 .026 .01\">has</w><w x=\".451 .596 .073 .013\">deployed</w><w x=\".534 .596 .024 .01\">the</w><w x=\".569 .596 .068 .01\">software</w><w x=\".647 .596 .028 .01\">and</w><w x=\".686 .599 .048 .007\">server</w><w x=\".743 .596 .076 .013\">platforms</w><w x=\".83 .596 .017 .01\">of</w><w x=\".856 .596 .024 .01\">the</w></l><l><w x=\".119 .613 .086 .013\">repository,</w><w x=\".216 .613 .015 .01\">in</w><w x=\".242 .616 .008 .007\">a</w><w x=\".261 .613 .086 .01\">virtualized</w><w x=\".358 .613 .028 .01\">and</w><w x=\".396 .613 .08 .01\">redundant</w><w x=\".487 .613 .113 .012\">infrastructure,</w><w x=\".611 .613 .011 .01\">it</w><w x=\".633 .613 .031 .01\">also</w><w x=\".675 .613 .033 .01\">take</w><w x=\".718 .616 .033 .007\">care</w><w x=\".762 .613 .017 .01\">of</w><w x=\".789 .613 .024 .01\">the</w><w x=\".823 .613 .056 .013\">design,</w></l><l><w x=\".119 .629 .104 .013\">development</w><w x=\".23 .629 .028 .01\">and</w><w x=\".264 .63 .103 .011\">management</w><w x=\".372 .629 .017 .01\">of</w><w x=\".395 .629 .024 .01\">the</w><w x=\".425 .629 .032 .01\">web</w><w x=\".464 .629 .046 .013\">portal</w><w x=\".517 .629 .087 .012\">(front-end)</w><w x=\".611 .629 .023 .01\">for</w><w x=\".64 .629 .024 .01\">the</w><w x=\".67 .629 .102 .013\">presentation,</w><w x=\".779 .629 .067 .01\">research</w><w x=\".852 .629 .028 .01\">and</w></l><l><w x=\".119 .645 .084 .013\">consulting</w><w x=\".209 .645 .033 .01\">data</w><w x=\".247 .645 .017 .01\">of</w><w x=\".269 .645 .024 .01\">the</w><w x=\".299 .645 .085 .013\">digitalized</w><w x=\".389 .645 .042 .01\">items</w><w x=\".438 .645 .054 .013\">(lyrics,</w><w x=\".499 .645 .044 .013\">lyrics</w><w x=\".549 .647 .034 .01\">text,</w><w x=\".589 .645 .088 .012\">interviews,</w><w x=\".683 .645 .052 .012\">books,</w><w x=\".741 .645 .063 .013\">poems).</w></l><l><w x=\".12 .695 .04 .009\">KEY</w><w x=\".165 .694 .077 .01\">WORDS:</w><w x=\".25 .698 .102 .01\">open-source,</w><w x=\".358 .694 .078 .012\">islandora,</w><w x=\".442 .694 .086 .013\">repository,</w><w x=\".534 .694 .05 .013\">digital</w><w x=\".591 .694 .063 .012\">archive,</w><w x=\".66 .694 .061 .01\">cultural</w><w x=\".726 .694 .064 .013\">heritage</w></l><l><w x=\".119 .744 .033 .01\">JEL</w><w x=\".157 .744 .069 .01\">CODES:</w><w x=\".234 .744 .03 .01\">Z11</w></l><l><w x=\".119 .864 .202 .001\">____________________</w></l><l><w x=\".12 .886 .119 .013\">*Corresponding</w><w x=\".244 .887 .05 .009\">author:</w><w x=\".302 .887 .178 .012\">anna.perin@ircres.cnr.it</w></l></b></p></ocr>"]}]
  }}
jbaiter commented 3 years ago

There you go, the OCR that you feed to the index does not have any whitespace between the words! The plugin relies on the whitespace in the OCR when parsing it, i.e. <w ...>hello</w><w>world</w> will parse to helloworld. Make sure you don't throw away the whitespace between the ocrx_word spans that you get back from djvu2hocr.

giancarlobi commented 3 years ago

@jbaiter a last question (I hope) why that happens with 0.6.0 and not with 0.5.0? Anyway thanks really a lot

jbaiter commented 3 years ago

Good question! The 0.5.0 code wrapped Lucene's HTMLStripCharFilter. This filter outputs a lot of extra whitespace/newlines between node texts.

For example, this is what your whitespace-less document looked like after being run through the HTMLStripCharFilter:

``` Rapporto Tecnico, numero 3, Agosto 2016 FABB Repository dal progetto al prototipo. Nuove forme di conservazione, condivisione e valorizzazione di opere digitali Giancarlo Birello, Ivano Fucile, Valter Giovanetti Ircres-CNR Ufficio IT Strada delle Cacce, 73 10135 Torino Italy Anna Perin* Ircres-CNR Biblioteca Via Real Collegio, 30 10024 Moncalieri TO Italy ABSTRACT: FABB project (Famine and Feast, Fame e Abbondanza) has been committed by Fondazione CRT. This technical report analyzes the strategies adopted and the main open- source software used. Ircres-CNR has deployed the software and server platforms of the repository, in a virtualized and redundant infrastructure, it also take care of the design, development and management of the web portal (front-end) for the presentation, research and consulting data of the digitalized items (lyrics, lyrics text, interviews, books, poems). KEY WORDS: open-source, islandora, repository, digital archive, cultural heritage JEL CODES: Z11 ____________________ *Corresponding author: anna.perin@ircres.cnr.it ```

The new parser only outputs whatever whitespace there is in the input document (and normalizes runs of consecutive spaces to a single space character to deal with indentation). If there is no whitespace in the input document, the parsed text will not have any whitespace either.

giancarlobi commented 3 years ago

Thanks a lot for your time on this, have a nice evening!!! Take into account, here there is a really good bottle of wine waiting for you for when come to Italy!!