DiegoPino commented 1 year ago

What?

When we produce (from the HOCR/PDFALTO) extraction the pure OCR text we keep the HTML entity encoding. This hurts Views display since internally, twig can not decode the entities and will double encode.

I (just theory) think this can be fixed here https://github.com/esmero/strawberry_runners/blob/9d3bf9ed2040856c1ec5dc9cb19a8a0d568481a5/src/Plugin/StrawberryRunnersPostProcessor/OcrPostProcessor.php#L355-L356

Basically, we don't want this:

Question (if fixing this) is how we remediate/tap into fixing this for existing OCRs. One way would be, on reindex detect if already cached Plain Text has HTML entities, decode and "update" the cache, somewhere here:

https://github.com/esmero/strawberryfield/blob/ce448a0ebe16650df19708459a4600d2c4d2c9e1/src/Plugin/search_api/datasource/StrawberryfieldFlavorDatasource.php#L661 but also could be a hook_update() ?

@aksm what do you think? @alliomeria what do you think? @karomabiles what do you think?

aksm commented 1 year ago

@DiegoPino I think I need more context/clarification to understand the issue. Can we discuss on the next team call?

DiegoPino commented 1 year ago

Please ingest an ADO with a PDF and see the OCR directly in Solr and in a view to see what I am stating here

Thanks!

Diego Pino Navarro Director of Digital Strategy Archipelago architect Metropolitan New York Library Council PO Box 2084 New York, NY 10108

On Aug 4, 2023, at 10:49 AM, Albert Min @.***> wrote:

@DiegoPino https://github.com/DiegoPino I think I need more context/clarification to understand the issue. Can we discuss on the next team call?

— Reply to this email directly, view it on GitHub https://github.com/esmero/strawberry_runners/issues/81#issuecomment-1665736625, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABU7ZZ7B7I47XKHEDGNRMS3XTUDYXANCNFSM6AAAAAA3BNAGLI. You are receiving this because you were mentioned.

esmero / strawberry_runners

Pure Text extraction from HOCR is HTML entity encoded #81

What?