ch-a-os / DocSort

Digitize and access everything, everywhere.
GNU General Public License v3.0
0 stars 1 forks source link

Feature: Parse text from PDF-files #15

Open ghost opened 5 years ago

ghost commented 5 years ago

In addition to the OCR-feature ( #14 ), we could parse embedded text-passages from pdf-files (or even more filetypes?).

Mondei1 commented 5 years ago

We can extract it from PDFs (with this https://github.com/modesty/pdf2json/) and from plain text files (obviously). I don't know If there any common file formats where you can do that.

ghost commented 5 years ago

Hmm... Maybe, later, we should rename the "ocr" parts to just "extractedText" and write a function for each file-format we can extract text from.