Calamari-OCR / calamari

Line based ATR Engine based on OCRopy
Apache License 2.0
1.04k stars 209 forks source link

Can I trained other language using this model??? #10

Closed ghost closed 6 years ago

ghost commented 6 years ago

First of all, Thanks for sharing amazing library.

I want train model for Korean.

Unlike english, Korean uses around 10000 different characters.

I've been training using this model. But It seems not works well.

#00065275: loss=7.06757734 ler=1.00000000 dt=0.05988510s
 PRED: '‪껜‬'
 TRUE: '‪쑤‬'
#00065276: loss=7.07436581 ler=1.00000000 dt=0.06000578s
 PRED: '‪솽‬'
 TRUE: '‪쯔‬'
#00065277: loss=7.07185513 ler=1.00000000 dt=0.06007761s
 PRED: '‪혀‬'
 TRUE: '‪쌈‬'
#00065278: loss=7.07688745 ler=1.00000000 dt=0.06012164s
 PRED: '‪힐‬'
 TRUE: '‪엊‬'
#00065279: loss=7.08528412 ler=1.00000000 dt=0.06010656s
 PRED: '‪뺙‬'
 TRUE: '‪뱐‬'
#00065280: loss=7.10926293 ler=1.00000000 dt=0.06014606s
 PRED: '‪띔‬'
 TRUE: '‪졀‬'
#00065281: loss=7.11099953 ler=1.00000000 dt=0.06006387s
 PRED: '‪솰‬'
 TRUE: '‪팰‬'

Should I change model hyper parameter or model structure ??

Any advice??

ChWick commented 6 years ago

In general Calamari is not designed to predict single characters, instead it is designed to predict a complete sequence of characters (sentence) as a whole. To predict a single character a simple Classification-Network might be better suited (see e.g. MNIST examples)

If you still want to use Calamari: There are many parameters that could possible effect the accuracy. You could try to increase the --batch_size e.g. to 128. Moreover another network structure could be usefull (--network). You can also try to limit the alphabet to test if Calamari is able to learn a smaller charset (e. g. 1000, just for testing)

ghost commented 6 years ago

@ChWick Thank you for your advice. I change data to sequence of character.

ghost commented 6 years ago

@ChWick I trianed using data provided by tesseract (https://github.com/tesseract-ocr/langdata/blob/master/kor/kor.training_text)

Training works quite well.

#00070662: loss=1.50009847 ler=0.01098767 dt=0.05774912s
 PRED: '‪10 연락 미용 톈진 강릉 끙 홍콩 월간 라 큰술 란 잇는 의회 쪄‬'
 TRUE: '‪10 연락 미용 톈진 강릉 끙 홍콩 월간 라 큰술 란 잇는 의회 쪄‬'
#00070663: loss=1.54514930 ler=0.01124408 dt=0.05764973s
 PRED: '‪넷째 발표 되며 ( 바향 모퉁이 세괌 16 뒤에 등 자료실 알뜰 늠름한‬'
 TRUE: '‪넷째 발표 되며 ( 방향 모퉁이 세괌 16 뒤에 등 자료실 알뜰 늠름한‬'
#00070664: loss=1.40779295 ler=0.01045460 dt=0.05747745s
 PRED: '‪카를로스 신지식 과 보다는 곳 수 바깥 역할 벼룩 질문 . 꿰어 중‬'
 TRUE: '‪카를로스 신지식 과 보다는 곳 수 바깥 역할 벼룩 질문 . 꿰어 중‬'
#00070665: loss=1.44664021 ler=0.01071776 dt=0.05732183s
 PRED: '‪쟌느 분 코뮌 디앤샵 건의 반침 19 헌법 법령 프톨레마이오스 > 골‬'
 TRUE: '‪쟌느 분 코뮌 디앤샵 건의 방침 19 헌법 법령 프톨레마이오스 > 골‬'
#00070666: loss=1.44412356 ler=0.01071776 dt=0.05723174s
 PRED: '‪17 숙박 조각 다룬다 커스텀 최저가 것이 사건 맥 답하기 뻘 탭‬'
 TRUE: '‪17 숙박 조각 다룬다 커스텀 최저가 것이 사건 맥 답하기 뻘 탭‬'

My sample prediction(sentence not in my training dataset) seems good

image

TRUE: 원대복귀 조치에 따라 둘은 육군으로 돌아가게 됐다.
PRE: 원대복귀 조치에 따라 둘은 육군으로 돌아가게 됐다.

Thanks again :+1:

P.S) In your README.md, It said Modules to segment pages into lines will be available soon.

You recomend to use OCRopy scripts. But It's not that good.

When can I check this module?

ChWick commented 6 years ago

@a41888936 I'm very glad u got this working! Unfortunately, the line segmentation part of our complete OCR-workflow also relies on the OCRopy scripts, therefore this module wont help you neither.