Feature: Parse text from PDF-files

ch-a-os / DocSort

Digitize and access everything, everywhere.

GNU General Public License v3.0

0 stars 1 forks source link

Feature: Parse text from PDF-files #15

Open ghost opened 5 years ago

ghost commented 5 years ago

In addition to the OCR-feature ( #14 ), we could parse embedded text-passages from pdf-files (or even more filetypes?).

Mondei1 commented 5 years ago

We can extract it from PDFs (with this https://github.com/modesty/pdf2json/) and from plain text files (obviously). I don't know If there any common file formats where you can do that.

ghost commented 5 years ago

Hmm... Maybe, later, we should rename the "ocr" parts to just "extractedText" and write a function for each file-format we can extract text from.

txt/rtf/doc/docx/excel/odt and so on are easily to parse
pdf there are 2 methods needed here, because some items in the pdf can be embedded as graphic while others are just text-passages
all image-formats are ready to parse by ocr