VikParuchuri / surya

OCR, layout analysis, reading order, line detection in 90+ languages
https://www.datalab.to
GNU General Public License v3.0
9.31k stars 591 forks source link

Issue parsing inverted (white on black) text #112

Open nhoffman opened 2 months ago

nhoffman commented 2 months ago

Hi there - I am looking into parsing laboratory test results (unfortunately results are often received as pdfs), and performance seems to be great except in a very specific context: a report that I'm looking at contains a critical element with white text on a black background. In this case the text is either not detected or read incorrectly. I'm a bit limited in what I can share so this is lacking context, but for example, failure to detect text:

image

Incorrect results:

image image

Any suggestions on settings or pre-processing strategies that might help?

Thanks a lot!

VikParuchuri commented 1 month ago

This is a really interesting edge case. I think the challenge is the "mostly regular text with some inverted". Some ideas:

nhoffman commented 1 month ago

Thanks a lot for the suggestions - I'd love to give the fine tuning approach a shot, but I'm not sure where to start. I know it's a big topic, but can you suggest a) a general resource describing how I would go about fine tuning the text detection model (eg, an overview of the process, how many examples you think might be sufficient, would I provide examples cropped to the white on black text vs providing examples in context); b) in the context of this project, where is the model specified (I assume it downloads a model from huggingface, but I can't seem to find where this configuration is located), and how would I update the the configuration to refer to the fine-tuned model. I'd certainly be happy to document the process for anyone else with a need for something similar.

Thanks a lot for any help!