Future-House / paper-qa

High accuracy RAG for answering questions from scientific documents with citations
Apache License 2.0
5.27k stars 504 forks source link

Any thoughts on OCR for older papers? (image-only) #12

Open sgbaird opened 1 year ago

sgbaird commented 1 year ago

EDIT: a related OCR/NLP avenue

whitead commented 1 year ago

Go for it - https://unstructured-io.github.io/unstructured/bricks.html#partition-pdf

usuyama commented 1 year ago

@sgbaird did you already try the one from unstructured.io?

I think OCR Cognitive Service API is also quite strong https://learn.microsoft.com/en-us/azure/applied-ai-services/form-recognizer/concept-read?view=form-recog-3.0.0

sgbaird commented 1 year ago

@usuyama not yet. Thanks for the suggestion! @ramseyissa and @hasan-sayeed are taking point on the project - taking https://mpds.io data and training a model to learn to extract that data from the full texts.

thiswillbeyourgithub commented 1 year ago

Not directly related to this discussion but I recently stumbled upon docTR, which might interest everyone here regarding OCR.

ghost commented 1 year ago

we can also go for aws textract detect text api

thiswillbeyourgithub commented 1 year ago

Btw, I have been successful at using tesseract (with the right parameters) and then sending the text to ChatGPT for cleanup. It cost very little and was actually great at correcting pretty much all spelling mistakes and even enhancing the formatting (fix indentation etc).

On the other hand docTR proved quite disappointing to me : it's probably great for everything that is NOT a screenshot (handwritten, picture with an angle etc)

ghost commented 1 year ago

I uses aws textract on day to day bases .It work pretty well on handwritten data.