OpenPecha / Toolkit

🛠 Tools to create, edit and export texts and annotations
https://toolkit.openpecha.org
Apache License 2.0
7 stars 4 forks source link

fix(ocr): checking the abnormal post correction feature added #264

Closed kaldan007 closed 6 months ago

kaldan007 commented 6 months ago

We have noticed that our post correction of character order bit strict in some case hence resulting in very unexpected output. This PR has one function added to check the abnormal postcorrection and a flag in ocr formatter object also in order to go through the checking. If the checking of postcorrection flag is true and the function find abnormality in the post correction, it would use the original line and character order given by google ocr output else it will use the post corrected one. The google vision formatter's checking postcorrection flag is by default false. Hence it will use the postcorrected order by default. The function checking the postcorrection is here.

eroux commented 6 months ago

detecting skewed lines in the right way is just more effort than what I can do right now, let's just merge that and import from GB, ideally in the future we should implement a proper line detection algorithm

ngawangtrinley commented 6 months ago

The main issue is curved and wobbly lines. It's unfortunately very common in woodblock printed material when the page moves to either side under the roller (https://www.youtube.com/watch?v=vow3YY9FnxY) and/or at scanning time when using a page feed scanner without an extra support for very long pages (something longer than the white support here: https://m.media-amazon.com/images/I/81KGnw1cd7L._AC_SX466_.jpg). I couldn't find a tool/script that does this out of the box so I think we will need to train a specialized CV model just for this. Once the curve/wobble is straightened, splitting lines is easy.