Microsoft OCR model is now giving extremely bad results

jonchang commented 5 days ago

Describe the bug

Microsoft OCR models are now giving very bad results (hallucinating) as seen in this gif from https://github.com/CDCgov/ReportVision/pull/316:

Kapture 2024-10-11 at 13 48 24

Impact

We cannot rely on OCR pipelines

To Reproduce Steps to reproduce the behavior:

Attempt to OCR this image using our pipeline:
Observe poor results: MILKE, JOHORA, JOHORA, JOHORA, JOHORA, JOHORANANANAN (confidence: 34.49%)

Expected behavior

Better results from the OCR

Additional context

Individual (not aggregate) confidence scores are extremely low when I dig into this code.
resolution of the input does not seem to matter - I created a 'hires' version and the output becomes MIKE INVOICE.COME FOR DETAILS. FREE SANDWICH!

tesseract has absolutely no problem with it

% tesseract lab_template_crop.png -  
Estimating resolution as 109
Mike

bora-skylight commented 5 days ago

@jonchang Thank you for picking this up!

@jonchang @schreiaj can you all add what changes you would want to make in order to address this bug outside of investigation?

jonchang commented 5 days ago

I don't think it's possible to say what the fix is without figuring out where the problem is coming from first.

schreiaj commented 5 days ago

@jonchang Thank you for picking this up!

@jonchang @schreiaj can you all add what changes you would want to make in order to address this bug outside of investigation?

Critical - we need to understand it, my proposed acceptance criteria for this ticket would be:

[ ] Able to reliably replicate hallucination of text
[ ] Document (~1 pager) underlying cause
[ ] Propose at least a single solution to resolve this

I'd also like to time box this investigation if possible, @jonchang do you think you can get the above things by Dev Sync on Friday?

jonchang commented 3 days ago

Attempt to OCR the following image:

lab_template_crop

The confidence scores for the bad results for this image are low (and we should decide later whether it makes sense to even return an OCR result if confidence is particularly bad). I suspected that the extra-wide "segment" from the gif in #316 was giving the model extra room to hallucinate. To test this I wrote a script that OCRs the "uncropped" segment, and compares it against a better "medium" crop and an ideal "close" crop.

```python import cv2 as cv from ocr.services.image_ocr import ImageOCR segment = cv.imread("lab_template_crop.png") _, w, _ = segment.shape segments = dict(wide=segment, medium=segment[:, :w // 2, :], close=segment[:, :w // 10, :]) ocr = ImageOCR() values = ocr.image_to_text(segments=segments) print("{:<8} {:<60} {:<10}".format("Label", "Text", "Confidence")) for label, (text, confidence) in values.items(): print("{:<8} {:<60} {:<10.2f}".format(label, text, confidence)) ```

We can see below that the widest "uncropped" version hallucinates the most, while the "medium" crop has higher confidence and has a better result, and the close crop has the best (and correct) result.

Label    Text                                                         Confidence
wide     MILKE, JOHORA, JOHORA, JOHORA, JOHORA, JOHORANANANAN         34.49     
medium   MIKE :                                                       78.07     
close    MIKE                                                         95.59

In https://github.com/CDCgov/ReportVision/pull/248 I implemented an algorithm to break up large blocks of text into individual lines. This could be adapted to "autocrop" blank space around text, or break up longer lines of text into separate words or phrases to prevent the OCR from hallucinating additional text.

The acceptance criteria in https://github.com/CDCgov/ReportVision/issues/207 said that this algorithm should not be run for single lines of text. In light of these results, this criteria should be loosened and I suggest that the subdivision algorithm in #248 be reused and tuned to more aggressively crop blank space around text and to send smaller lines of text to the OCR model.

CDCgov / ReportVision

Microsoft OCR model is now giving extremely bad results #320