Open DiegoPino opened 1 year ago
@DiegoPino I think I need more context/clarification to understand the issue. Can we discuss on the next team call?
Please ingest an ADO with a PDF and see the OCR directly in Solr and in a view to see what I am stating here
Thanks!
Diego Pino Navarro Director of Digital Strategy Archipelago architect Metropolitan New York Library Council PO Box 2084 New York, NY 10108
On Aug 4, 2023, at 10:49 AM, Albert Min @.***> wrote:
@DiegoPino https://github.com/DiegoPino I think I need more context/clarification to understand the issue. Can we discuss on the next team call?
— Reply to this email directly, view it on GitHub https://github.com/esmero/strawberry_runners/issues/81#issuecomment-1665736625, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABU7ZZ7B7I47XKHEDGNRMS3XTUDYXANCNFSM6AAAAAA3BNAGLI. You are receiving this because you were mentioned.
What?
When we produce (from the HOCR/PDFALTO) extraction the pure OCR text we keep the HTML entity encoding. This hurts Views display since internally, twig can not decode the entities and will double encode.
I (just theory) think this can be fixed here https://github.com/esmero/strawberry_runners/blob/9d3bf9ed2040856c1ec5dc9cb19a8a0d568481a5/src/Plugin/StrawberryRunnersPostProcessor/OcrPostProcessor.php#L355-L356
Basically, we don't want this:
Question (if fixing this) is how we remediate/tap into fixing this for existing OCRs. One way would be, on reindex detect if already cached Plain Text has HTML entities, decode and "update" the cache, somewhere here:
https://github.com/esmero/strawberryfield/blob/ce448a0ebe16650df19708459a4600d2c4d2c9e1/src/Plugin/search_api/datasource/StrawberryfieldFlavorDatasource.php#L661 but also could be a hook_update() ?
@aksm what do you think? @alliomeria what do you think? @karomabiles what do you think?