Open caseyg opened 4 years ago
@caseyg Still need more detail on this. Please add what you know to the issue description, then re-assign to me. Thanks.
@elazar updated the issue description. I'm probably going to be offline for rest of today, but I think the best next step is to try out a different module, before we reinvent the wheel. I will leave unassigned for now, and feel free to poke around if you're interested! But I can also mess with the modules at some point to get more info, and let you know if I hit a sticking point.
There are a variety of modules which do something very similar: extract text from PDF documents ingested by Omeka.
A: https://github.com/omeka-s-modules/ExtractText B: https://github.com/Daniel-KM/Omeka-S-module-PdfText C: https://github.com/symac/Omeka-S-module-ExtractOcr
They all appear to use the same poppler-utils/pdf2text library but have some differences.
We are currently using module A: ExtractText. Its approach is to create an
extracttext
metadata field and drop the entire PDF content in there as one gigantic blob of text.For future search and design reasons, it would be better to have extracted text marked up with page breaks. A future search interface could highlight pages numbers on which string X appears, and an anchor-element could be used to deeplink to the text of a specific page.
For these reasons, among others, it would be preferable to have extracted text stored as XML or HTML with tags indicating page breaks.
(I believe this is the approach taken by module C: ExtractOCR which "extracts OCR text in XML from PDF files." but I haven't played around with it yet.)
This thread from the Omeka Forum also indicates that this is the "blessed" option for text-type content, as markup is not allowed in Omeka S metadata fields.
Recommended next step: find out if module C: Extract OCR, works well for our purposes, and if it can reprocess already uploaded media.