Would you think of implementing Tappen's subextractor code ?

Betsy25 commented 10 years ago

Comparing the way subextractor's "Compare via OCR" works compared to Subtitle Edit's, the former one is extremely smart programmed, and after a few OCR's, and the way it allow to select the correct word where l (el) and I (uppercase i) after the OCR is done, it's able to build some fantastic results after a few usages, resulting in 100% correct OCR's.

Subtitle Edit's method leaves a lot to be desired in that regard.

Could it be possible to ask Tappen (on the Doom9 - Subtitle forum) if you could borrow & implement his code in Subtitle Edit, If you ever should consider that please ?

Sourcecode for subextractor - http://subextractor.codeplex.com/

TIA!

vutienphat commented 10 years ago

I've used both programs. SE ORC works well for chinese, while Subextractor have some problems.

niksedk commented 10 years ago

Yeah, it should be possible to add subextractor - especially now that SE also uses .net framework 4. I've have many subtitles that work very badly with subextractor - subextractor excels where letters are always drawn exactly the same, pixel for pixel.

OCR via "Image compare" is just not enough! I've experimented with line segments and random pixels too and that works okay but still not well with many variations in pixels. Neural networks would be interesting to test. If you edit "Settings.xml" and set "ShowBetaStuff" to "True" a few extra OCR methods will become available... The current OCR via "image compare" in SE does not have the best letter splitter which especially shows when working with italics, and the file format (xml / images) is very inefficient. In the "New image compare" (beta) this has been addressed - but still I feel it should be better... if you're an OCR expert or neural net expert or just have some good ideas please let me know :)

SubtitleEdit / subtitleedit

Would you think of implementing Tappen's subextractor code ? #215