CDRH / open-oni_nebraska_theme

Nebraska child theme for open-oni github.com/open-oni
0 stars 0 forks source link

Fix indescribeablebeast batch / processing #115

Open techgique opened 2 years ago

techgique commented 2 years ago

nbu_indescribeablebeast will not ingest and crashes with this error:

INFO:core.batch_loader:Assigned page sequence: 2
INFO:core.batch_loader:Saving page. issue date: 1924-10-20 00:00:00, page sequence: 2
ERROR:core.batch_loader:unable to load batch: EOL while scanning string literal (<string>, line 1)
ERROR:core.batch_loader:EOL while scanning string literal (<string>, line 1)
Traceback (most recent call last):
  File "/var/local/www/django/openoni/core/batch_loader.py", line 172, in load_batch
    issue = self._load_issue(mets_url)
  File "/var/local/www/django/openoni/core/batch_loader.py", line 294, in _load_issue
    page = self._load_page(doc, page_div, issue)
  File "/var/local/www/django/openoni/core/batch_loader.py", line 414, in _load_page
    self.process_ocr(page)
  File "/var/local/www/django/openoni/core/batch_loader.py", line 448, in process_ocr
    self.solr.add(**page.solr_doc)
  File "/var/local/www/django/openoni/ENV/lib/python2.7/site-packages/solr/core.py", line 684, in add
    return Solr.add_many(self, [fields], commit=_commit)
  File "/var/local/www/django/openoni/ENV/lib/python2.7/site-packages/solr/core.py", line 325, in wrapper
    content = function(self, *args, **kw)
  File "/var/local/www/django/openoni/ENV/lib/python2.7/site-packages/solr/core.py", line 512, in add_many
    self.__add(lst, doc)
  File "/var/local/www/django/openoni/ENV/lib/python2.7/site-packages/solr/core.py", line 598, in __add
    elem['value'] = escape(unicode(value))
  File "/var/local/www/django/openoni/ENV/lib/python2.7/site-packages/solr/core.py", line 1111, in __setitem__
    tmp = eval(value)
  File "<string>", line 1
    {'"Coolidge Starts
                     ^
SyntaxError: EOL while scanning string literal
WARNING:root:no OcrDump to delete for batch_nbu_indescribablebeast_ver01 (University of Nebraska-Lincoln Libraries, Lincoln, NE)
ERROR:core.management.commands.load_batch:unable to load batch: EOL while scanning string literal (<string>, line 1)
Traceback (most recent call last):
  File "/var/local/www/django/openoni/core/management/commands/load_batch.py", line 43, in handle
    batch = loader.load_batch(batch_path)
  File "/var/local/www/django/openoni/core/batch_loader.py", line 201, in load_batch
    raise BatchLoaderException(msg)
BatchLoaderException: unable to load batch: EOL while scanning string literal (<string>, line 1)
CommandError: Batch load failed. See logs/load_batch_#.log

The batch is available on Chronicling America at https://chroniclingamerica.loc.gov/batches/nbu_indescribablebeast_ver01/ and the page causing the error is at https://chroniclingamerica.loc.gov/lccn/sn84024326/1924-10-20/ed-1/seq-2/

batch_nbu_indescribablebeast_ver01/data/sn84024326/00332899314/1924102001/0359.xml appears to be the file the bug is coming from but I'm not certain yet how to bypass it at the moment. Still reviewing related code and how it handles the text.

techgique commented 2 years ago

Have read through more of the code to understand how OCR text is processed for word coordinates etc.

Relevant section of 0359.xml:

<TextLine ID="LINE1" STYLEREFS="TS16" HEIGHT="349" WIDTH="2285" HPOS="448" VPOS="1548">
<String ID="S1" CONTENT="{&apos;&quot;Coolidge" WC="0.455" CC="5 8 6 7 7 1 5 0 5 7 3" HEIGHT="349" WIDTH="1441" HPOS="448" VPOS="1548"/>
<SP ID="SP1" WIDTH="77" HPOS="1892" VPOS="1568"/>
<String ID="S2" CONTENT="Starts" WC="0.778" CC="4 0 0 0 3 5" HEIGHT="281" WIDTH="761" HPOS="1972" VPOS="1568"/>
</TextLine>
<TextLine ID="LINE2" STYLEREFS="TS16" HEIGHT="441" WIDTH="2681" HPOS="452" VPOS="1948">
<String ID="S3" CONTENT=":" WC="0.222" CC="7" HEIGHT="141" WIDTH="29" HPOS="452" VPOS="2196"/>
<SP ID="SP2" WIDTH="229" HPOS="484" VPOS="2108"/>
<String ID="S4" CONTENT="&quot;letter" WC="0.794" CC="0 4 5 0 3 1 0" HEIGHT="329" WIDTH="1037" HPOS="716" VPOS="1948"/>
<SP ID="SP3" WIDTH="73" HPOS="1756" VPOS="2000"/>
<String ID="S5" CONTENT="Campaijrn" WC="0.568" CC="7 0 0 5 0 6 5 7 5" HEIGHT="389" WIDTH="1301" HPOS="1832" VPOS="2000"/>
</TextLine>
techgique commented 2 years ago

Fixed by removing special characters in front of CONTENT="{&apos;&quot;Coolidge" in 0359.xml. Copied the unedited file as 0359.xml.orig to try restoring once we upgrade to Open ONI 1.x. Will keep the issue open until we know whether the Solr library change handles the content or not