Open techgique opened 2 years ago
Have read through more of the code to understand how OCR text is processed for word coordinates etc.
Relevant section of 0359.xml
:
<TextLine ID="LINE1" STYLEREFS="TS16" HEIGHT="349" WIDTH="2285" HPOS="448" VPOS="1548">
<String ID="S1" CONTENT="{'"Coolidge" WC="0.455" CC="5 8 6 7 7 1 5 0 5 7 3" HEIGHT="349" WIDTH="1441" HPOS="448" VPOS="1548"/>
<SP ID="SP1" WIDTH="77" HPOS="1892" VPOS="1568"/>
<String ID="S2" CONTENT="Starts" WC="0.778" CC="4 0 0 0 3 5" HEIGHT="281" WIDTH="761" HPOS="1972" VPOS="1568"/>
</TextLine>
<TextLine ID="LINE2" STYLEREFS="TS16" HEIGHT="441" WIDTH="2681" HPOS="452" VPOS="1948">
<String ID="S3" CONTENT=":" WC="0.222" CC="7" HEIGHT="141" WIDTH="29" HPOS="452" VPOS="2196"/>
<SP ID="SP2" WIDTH="229" HPOS="484" VPOS="2108"/>
<String ID="S4" CONTENT=""letter" WC="0.794" CC="0 4 5 0 3 1 0" HEIGHT="329" WIDTH="1037" HPOS="716" VPOS="1948"/>
<SP ID="SP3" WIDTH="73" HPOS="1756" VPOS="2000"/>
<String ID="S5" CONTENT="Campaijrn" WC="0.568" CC="7 0 0 5 0 6 5 7 5" HEIGHT="389" WIDTH="1301" HPOS="1832" VPOS="2000"/>
</TextLine>
Fixed by removing special characters in front of CONTENT="{'"Coolidge"
in 0359.xml
. Copied the unedited file as 0359.xml.orig
to try restoring once we upgrade to Open ONI 1.x. Will keep the issue open until we know whether the Solr library change handles the content or not
nbu_indescribeablebeast will not ingest and crashes with this error:
The batch is available on Chronicling America at https://chroniclingamerica.loc.gov/batches/nbu_indescribablebeast_ver01/ and the page causing the error is at https://chroniclingamerica.loc.gov/lccn/sn84024326/1924-10-20/ed-1/seq-2/
batch_nbu_indescribablebeast_ver01/data/sn84024326/00332899314/1924102001/0359.xml
appears to be the file the bug is coming from but I'm not certain yet how to bypass it at the moment. Still reviewing related code and how it handles the text.