LanguageMachines / PICCL

A set of workflows for corpus building through OCR, post-correction and normalisation
Other
48 stars 6 forks source link

Extracting embedded text form PDFs #6

Closed martinreynaert closed 6 years ago

martinreynaert commented 7 years ago

One form of possible input PICCL-NF is not yet capable of is that of embedded text in PDFs.

Ideally the user should not have to specify at all whether the PDF(s) contain embedded text or not. The philosophy behind PICCL (at least towards philosopher-users) is that the system has the intelligence to decide which modules to use given a particular type of input.

There are, however, quite a number of scenarios possible, given a PDF with embedded text, e.g. a/ the embedded text may be of very poor OCR quality so that re-OCRing is desirable. The embedded text might then still serve as extra input to TICCL to help in the post-correction b/ the embedded text may actually be the original born-digital text and thus be of higher quality then what will likely be obtained be re-OCRing. However, extracting embedded text from PDFs has pitfalls: more often than not all ligatures are lost (which might be solved perhaps by OCRing anyway and then running TICCL over the extracted embedded text again with the OCRed text as additional background input). c/ d/ ?? There are more possible scenarios, I assume.

proycon commented 7 years ago

Yes. I agree that it should be detected automatically in the pipeline. That raises the following subissues to solve before we can implement this:

1) How to detect whether a pdf has embedded text or not? 1a) how to determine if embedded text is OCRed text? How to determine the quality? And how to make use of this in ticcl as you suggest? 2) What tool to use for pdf text extraction? (pdf2text or other)

martinreynaert commented 7 years ago

I think that for now we should just use pdf2text.

You might make it so that by default it is run. If there is a sensible amount of textual output: assume that there was embedded text. If not, proceed with OCR only.

Question 1a/ is harder. I have ideas about this, but these are still underdeveloped. And might well not work on smaller amounts of text.

martinreynaert commented 7 years ago

Come to think of it: I should check whether 'pdfinfo' with the -meta parameter has anything to say about there being embedded text or not.

proycon commented 6 years ago

@martinreynaert Do you have a good input example text for this issue? Again preferably a small and open one.

proycon commented 6 years ago

This is now implemented in the TICCL pipeline itself, by chosing --inputtype pdf chosen, which by definition corresponds to text extraction from PDF. Conversely, for the OCR pipeline, PDF by definition means image extraction.

Conversion is done using pdftotext.

The solution for automatically determining whether to OCR or extract text has not been implemented yet but is continued in issue #11 instead.