Open jonchang opened 5 days ago
@jonchang Thank you for picking this up!
@jonchang @schreiaj can you all add what changes you would want to make in order to address this bug outside of investigation?
I don't think it's possible to say what the fix is without figuring out where the problem is coming from first.
@jonchang Thank you for picking this up!
@jonchang @schreiaj can you all add what changes you would want to make in order to address this bug outside of investigation?
Critical - we need to understand it, my proposed acceptance criteria for this ticket would be:
I'd also like to time box this investigation if possible, @jonchang do you think you can get the above things by Dev Sync on Friday?
Attempt to OCR the following image:
The confidence scores for the bad results for this image are low (and we should decide later whether it makes sense to even return an OCR result if confidence is particularly bad). I suspected that the extra-wide "segment" from the gif in #316 was giving the model extra room to hallucinate. To test this I wrote a script that OCRs the "uncropped" segment, and compares it against a better "medium" crop and an ideal "close" crop.
We can see below that the widest "uncropped" version hallucinates the most, while the "medium" crop has higher confidence and has a better result, and the close crop has the best (and correct) result.
Label Text Confidence
wide MILKE, JOHORA, JOHORA, JOHORA, JOHORA, JOHORANANANAN 34.49
medium MIKE : 78.07
close MIKE 95.59
In https://github.com/CDCgov/ReportVision/pull/248 I implemented an algorithm to break up large blocks of text into individual lines. This could be adapted to "autocrop" blank space around text, or break up longer lines of text into separate words or phrases to prevent the OCR from hallucinating additional text.
The acceptance criteria in https://github.com/CDCgov/ReportVision/issues/207 said that this algorithm should not be run for single lines of text. In light of these results, this criteria should be loosened and I suggest that the subdivision algorithm in #248 be reused and tuned to more aggressively crop blank space around text and to send smaller lines of text to the OCR model.
Describe the bug
Microsoft OCR models are now giving very bad results (hallucinating) as seen in this gif from https://github.com/CDCgov/ReportVision/pull/316:
Impact
We cannot rely on OCR pipelines
To Reproduce Steps to reproduce the behavior:
MILKE, JOHORA, JOHORA, JOHORA, JOHORA, JOHORANANANAN
(confidence: 34.49%)Expected behavior
Better results from the OCR
Additional context
Individual (not aggregate) confidence scores are extremely low when I dig into this code.
resolution of the input does not seem to matter - I created a 'hires' version and the output becomes
MIKE INVOICE.COME FOR DETAILS. FREE SANDWICH!
tesseract has absolutely no problem with it