Open timj opened 3 hours ago
Ah, and I just spotted that ffmpeg didn't get the first line right because it came up with l'm
instead of I'm
(Which is something you think should be impossible if it's doing language analysis).
But IIRC visually the images all look basically the same.
As for the language correction, I've been doing a little research and it appears it doesn't necessarily use anything super fancy to correct wrong spellings of words. It just tries to recognize letters in logical groups (like words) instead of individually (which is what happens when you turn language correction off) and use that as a contextual clue for the model. So basically, if I understand it correctly, it isn't being very smart because it hasn't been fully trained to be.
I suspect that apple uses some sort of API to do an extra layer of a type of spellcheck when using their programs like preview. I still want to do more experimentation and see if things like adding a background or scaling help the OCR.
I also would love to figure out a way to programmatically add extra spacing between close words in the image but I think that might be a little beyond my ability to do unless I can invent a algorithm to do so, but I'm not holding my breath.
Using current
main
and the subtitle track from the fuzz.tgz that I had uploaded previously, I decided to try the four different configurations to see how OCR compared.For reference the actual text for these two images is:
(which I got by copying and pasting from Preview.app and which is perfect). The ffmpeg images do still have much smaller boundaries than the internal.
Internal Decoder
Internal + Invert
ffmpeg decoder
ffmpeg + invert
Summary
For subtitle 10 ffmpeg+invert is perfect but internal has serious issues. For subtitle 11 ffmpeg gets the first line right and internal+invert is closest for the second line (with an extra comma at the end).