ecdye / macSubtitleOCR

Convert bitmap subtitles into SubRip format using the macOS OCR engine
MIT License
9 stars 1 forks source link

Comparing OCR performance #29

Open timj opened 3 hours ago

timj commented 3 hours ago

Using current main and the subtitle track from the fuzz.tgz that I had uploaded previously, I decided to try the four different configurations to see how OCR compared.

For reference the actual text for these two images is:

Erm, well, we're having a sale on X-Men comics

and I'm drawing Wolverine slashing prices with his adamantine claws.

(which I got by copying and pasting from Preview.app and which is perfect). The ffmpeg images do still have much smaller boundaries than the internal.

Internal Decoder

10
00:00:44,439 --> 00:00:46,571
Erm,well, we'relhaving
fasaleioniX.Menicomics)

11
00:00:47,719 --> 00:00:50,705
and I'mdrawing/Wolverineslashing
prices-with hisiadamantine claws.

Internal + Invert

10
00:00:44,439 --> 00:00:46,571
Erm,,well, we'relhaving
(ajsaleioniX.Menicomicsi

11
00:00:47,719 --> 00:00:50,705
and| im drawing/Wolverine.slashing
prices with his adamantine claws.,

ffmpeg decoder

10
00:00:44,439 --> 00:00:47,670
Erm, well, we're having
a sale oni X.Men comics

11
00:00:47,719 --> 00:00:52,235
and l'm drawing Wolverine slashing
prices with his:adamantine claws.

ffmpeg + invert

10
00:00:44,439 --> 00:00:47,670
Erm, well, we're having
a sale on X-Men comics

11
00:00:47,719 --> 00:00:52,235
andll'im drawing Wolverine slashing
prices with his.adamantine claws.

Summary

For subtitle 10 ffmpeg+invert is perfect but internal has serious issues. For subtitle 11 ffmpeg gets the first line right and internal+invert is closest for the second line (with an extra comma at the end).

timj commented 3 hours ago

Ah, and I just spotted that ffmpeg didn't get the first line right because it came up with l'm instead of I'm (Which is something you think should be impossible if it's doing language analysis).

ecdye commented 6 minutes ago

But IIRC visually the images all look basically the same.

As for the language correction, I've been doing a little research and it appears it doesn't necessarily use anything super fancy to correct wrong spellings of words. It just tries to recognize letters in logical groups (like words) instead of individually (which is what happens when you turn language correction off) and use that as a contextual clue for the model. So basically, if I understand it correctly, it isn't being very smart because it hasn't been fully trained to be.

I suspect that apple uses some sort of API to do an extra layer of a type of spellcheck when using their programs like preview. I still want to do more experimentation and see if things like adding a background or scaling help the OCR.

I also would love to figure out a way to programmatically add extra spacing between close words in the image but I think that might be a little beyond my ability to do unless I can invent a algorithm to do so, but I'm not holding my breath.