alephdata / aleph

Search and browse documents and data; find the people and companies you look for.
http://docs.aleph.occrp.org
MIT License
2.01k stars 270 forks source link

Option to disable OCR while uploading a document to Aleph #2122

Open sunu opened 2 years ago

sunu commented 2 years ago

Sometimes a PDF document we're uploading already has a layer of OCRed text. Currently Aleph OCRs the document again and the extracted text ends up with duplicates.

Ideally, we should provide a way to tell Aleph not to OCR a document while uploading through alephclient or the UI.

sunu commented 2 years ago

Here's a document with OCRed text to test this behaviour: ocr.pdf