esmero / strawberry_runners

A post processing Drupal 8/9 module for Strawberryfield dispatched events
GNU Lesser General Public License v3.0
3 stars 2 forks source link

PDFALTO non fatal errors breaking OCR #74

Closed DiegoPino closed 3 weeks ago

DiegoPino commented 1 year ago

What?

This is a multi-issue issue. We found a PDF that when processed through PDFALTO did generate correct OCR but also was throwing thousands on PDF standard syntax errors. Because the output of PDFALTO goes to the console directly (terminal) the resulting XML could not be processed. But here is where the larger issue happened, when Hydroponics was set 0 (means run until finishing) the failure was triggering an eternal re-enqueing (I'm pretty sure I coded 3x max retries) and getting stuck for days trying over and over.

image

For reference, the command run manually threw this type of syntax errors (PDF Standard non-compliant issue)

Syntax Error (1675049): Incorrect number of arguments in 'sc' command
Syntax Error (1675178): Incorrect number of arguments in 'sc' command
<?xml version="1.0" encoding="UTF-8"?>
<alto xmlns="http://www.loc.gov/standards/alto/ns-v3#" xsi:schemaLocation="http://www.loc.gov/standards/alto/v3/alto.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"><Description><MeasurementUnit>pixel</MeasurementUnit>
DiegoPino commented 1 year ago

@giancarlobi you might be interested in this one! @alliomeria shared that the offending PDF was built byBluebeam program used to transform architectural drawings into PDFs. So the PDF might is not standard

DiegoPino commented 3 weeks ago

Having a config solves the problem here. So closing as solved