Issue parsing inverted (white on black) text

nhoffman commented 2 months ago

Hi there - I am looking into parsing laboratory test results (unfortunately results are often received as pdfs), and performance seems to be great except in a very specific context: a report that I'm looking at contains a critical element with white text on a black background. In this case the text is either not detected or read incorrectly. I'm a bit limited in what I can share so this is lacking context, but for example, failure to detect text:

Incorrect results:

Any suggestions on settings or pre-processing strategies that might help?

Thanks a lot!

VikParuchuri commented 1 month ago

This is a really interesting edge case. I think the challenge is the "mostly regular text with some inverted". Some ideas:

Finetune the text detection model with negative examples
Flood fill (I think it's called flood fill) from a corner with black (which will just leave the number 36 white), then invert colors and do OCR. Then OCR the normal page. Merge the two results by just blanking out any regions in the normal page where the inverted page has text.

nhoffman commented 1 month ago

Thanks a lot for the suggestions - I'd love to give the fine tuning approach a shot, but I'm not sure where to start. I know it's a big topic, but can you suggest a) a general resource describing how I would go about fine tuning the text detection model (eg, an overview of the process, how many examples you think might be sufficient, would I provide examples cropped to the white on black text vs providing examples in context); b) in the context of this project, where is the model specified (I assume it downloads a model from huggingface, but I can't seem to find where this configuration is located), and how would I update the the configuration to refer to the fine-tuned model. I'd certainly be happy to document the process for anyone else with a need for something similar.

Thanks a lot for any help!

VikParuchuri / surya

Issue parsing inverted (white on black) text #112