LanguageMachines / PICCL

A set of workflows for corpus building through OCR, post-correction and normalisation
Other
48 stars 6 forks source link

FoLiA alignments in OCR output #44

Closed proycon closed 4 years ago

proycon commented 6 years ago

This may be more of a Ticcltools or foliautils issue, but I'll post it here as it is the outcome of the pipeline. When running a document through OCR, we obtain very verbose untokenised FoLiA output as follows:

<p xml:id="FH-OllevierGeets-001-000.tif.text.par_1_10">
 <t class="OCR">
   <t-str id="FH-OllevierGeets-001-000.tif.text.par_1_10.word_1_13">DISEASES</t-str>
   <t-str id="FH-OllevierGeets-001-000.tif.text.par_1_10.word_1_14">OF</t-str>
   <t-str id="FH-OllevierGeets-001-000.tif.text.par_1_10.word_1_15">AQUATIC</t-str>
   <t-str id="FH-OllevierGeets-001-000.tif.text.par_1_10.word_1_16">ORGANISMS</t-str>
   <t-str id="FH-OllevierGeets-001-000.tif.text.par_1_10.word_1_17">Dis.</t-str>
   <t-str id="FH-OllevierGeets-001-000.tif.text.par_1_10.word_1_18">aquat.</t-str>
   <t-str id="FH-OllevierGeets-001-000.tif.text.par_1_10.word_1_19">Org.</t-str>
 </t>
 <str annotator="folia-hocr" datetime="2018-11-19T20:47:13" xml:id="FH-OllevierGeets-001-000.tif.text.par_1_10.word_1_13"><t class="OCR" offset="0">DISEASES</t>
   <alignment xlink:href="FH-OllevierGeets-001-000.tif" xlink:type="simple">
    <aref id="word_1_13" type="str"/>
 </alignment>
</str>

My question is about the alignments here. They refer to tif images and mention an ID. I realize you want to tie each word to its occurrence in the image. But I don't think the TIF file contains this information (being just a bitmap afaik). Shouldn't this link to the hOCR output instead? (or is ALTO XML still involved here and should it be that?). (@kosloot I'd suggest adding a format attribute on the alignment to make clear to what kind of file (mimetype) it links)

Moreover, is this intermediate output that the PICCL OCR pipeline should publish as output for the user? Because it currently doesn't. And linking to something you don't output seems fairly useless.

During our last meeting @kdepuydt lamented that the FoLiA XML output of TICCL was not very human-readable, where she has a point, but it is also kind of inevitable if you want to include all this higher-order information. The question is whether everybody wants to? A possible suggestion here could also be to make outputting certain information optional (such as the substrings and alignments). Still, I'd rather include too much information than too little.

kdepuydt commented 6 years ago

Nice word choice , "lamented". It is a serious issue. At CLIN 2018 you explained that Folia is a format for machines. Still, users need to be able to see the output, and have an indication of the quality. I would think that in the output, all the information is kept, but that there is a means to select the information you want to see in a viewer. eg. view 1: text only view 2: show below each text line the ticcle layer view 3: show below the ticcle layer for each word PoS and lemma. Kind of similar to what is implemented in Nederlab

proycon commented 6 years ago

Yes, I agree, viewers should ideally allow to filter the necessary information and present only what the user asks for. That's what FLAT does too (but there are still issues visualising TICCL output currently), but at least the link is now set up.

An additional plain-text output in PICCL sounds like a good idea and is simple to implemented, let's see what @martinreynaert says.

kosloot commented 6 years ago

yes, that's what you want. All details available, but 'filtered out' when not needed.

@kdepuydt Be glad that you don't see the HOCR files, because those are really to be lamented about. :) For instance a SINGLE space somewhere, in the file:

<span class='ocr_line' id='line_1_1' title="bbox 0 859 68 1017; baseline 0 -98"><span class='ocrx_word' id='word_1_1' title='bbox 0 859 68 1017; x_wconf 95' lang='deu-frak' dir='ltr'>   </span> 
  </span>

Regarding to the alignments: @proycon You are right, they should refer to the HOCR file, not the tiff. I'll fix this and on the fly will add a format attribute.Does HOCR have a special Mime type?

proycon commented 6 years ago

Does HOCR have a special Mime type?

As per RFC3023 I guess we'd get: application/hocr+xml or text/hocr+xml

kosloot commented 6 years ago

I now implemented the improved ''href'' and ''format'' attributes for both ''hocr'' and ''page''

proycon commented 4 years ago

I'm not sure to what extent this issue is still open/relevant? I know there have been quite some changes in the ticcltools output.

proycon commented 4 years ago

(expired)