OCR POC with PDFTron - Githubissues

bcgov / foi-flow

Freedom of Information modernization

Apache License 2.0

5 stars 3 forks source link

OCR POC with PDFTron #4238

Open m-prodan opened 1 year ago

m-prodan commented 1 year ago

Title of ticket:

Description

This is a key dependency for #4120 - we know PDFtron uses a third party OCR library, can we do the same and make it work to OCR image files?

Dependencies

Are there any dependencies?

DOD

[ ] Analyze for Build vs Buy - PDFTron's IRIS OCR vs Custom OCR lib from us - Create samples of Samples(Handwritten, Scanned Printed etc. from CFD team)
[ ] See what options are to export OCR'd records so that they are searchable by interest holders outside the system
[ ] Is PDFTRON's Export TExt exposed in API to use our Custom OCR ? if not, buy is only option
[ ] Research and Choose Library
[ ] Create POC for OCR integration with PDFTron

lmullane commented 1 year ago

Check to see if you own OCR library can highlight search terms in the redaction app.

Try scanned PDFs and handwritten notes.

nkan-aot commented 1 year ago

OCR POC Analysis.xlsx

nkan-aot commented 1 year ago

I have uploaded my analysis notes above. Moving this task to Review and we will wait for the PDFTron demo before making a final decision

abin-aot commented 1 year ago

Need to analyze PDFTON's Backend OCR library and other Tessaract Front end options. Need to re-estimate as we start this task, moving back to Product Backlog cc: @nkan-aot , @m-prodan

abin-aot commented 11 months ago

New thought as discussed, with using Scanner/Hardware with any availble software to Scan directly to PDF(a must!) and #2, if possible convert those PDFs to searchable PDFs , so that we can use it on our DocReviewer App. Need to try out things on FISGARD office on machine that has access to Scanning Team's scanners and better to use their exact machine to try out options utilizing their systems resources - in other words, to do this my IDIR need access to those scanners, work stations cc: @lmullane , @m-prodan

nkan-aot commented 10 months ago

Here is additional analysis after meeting with PDFTron OCR comparison.xlsx python POC's have been uploaded to ms teams dev channel folder tesseract.py POC has been pushed to branch dev-NK-4238