Open ghost opened 5 years ago
We can extract it from PDFs (with this https://github.com/modesty/pdf2json/) and from plain text files (obviously). I don't know If there any common file formats where you can do that.
Hmm... Maybe, later, we should rename the "ocr" parts to just "extractedText" and write a function for each file-format we can extract text from.
txt/rtf/doc/docx/excel/odt
and so on are easily to parsepdf
there are 2 methods needed here, because some items in the pdf can be embedded as graphic while others are just text-passages
In addition to the OCR-feature ( #14 ), we could parse embedded text-passages from pdf-files (or even more filetypes?).