Closed ghost closed 6 years ago
In general Calamari is not designed to predict single characters, instead it is designed to predict a complete sequence of characters (sentence) as a whole. To predict a single character a simple Classification-Network might be better suited (see e.g. MNIST examples)
If you still want to use Calamari: There are many parameters that could possible effect the accuracy. You could try to increase the --batch_size e.g. to 128. Moreover another network structure could be usefull (--network). You can also try to limit the alphabet to test if Calamari is able to learn a smaller charset (e. g. 1000, just for testing)
@ChWick Thank you for your advice. I change data to sequence of character.
@ChWick I trianed using data provided by tesseract (https://github.com/tesseract-ocr/langdata/blob/master/kor/kor.training_text)
Training works quite well.
#00070662: loss=1.50009847 ler=0.01098767 dt=0.05774912s
PRED: '10 연락 미용 톈진 강릉 끙 홍콩 월간 라 큰술 란 잇는 의회 쪄'
TRUE: '10 연락 미용 톈진 강릉 끙 홍콩 월간 라 큰술 란 잇는 의회 쪄'
#00070663: loss=1.54514930 ler=0.01124408 dt=0.05764973s
PRED: '넷째 발표 되며 ( 바향 모퉁이 세괌 16 뒤에 등 자료실 알뜰 늠름한'
TRUE: '넷째 발표 되며 ( 방향 모퉁이 세괌 16 뒤에 등 자료실 알뜰 늠름한'
#00070664: loss=1.40779295 ler=0.01045460 dt=0.05747745s
PRED: '카를로스 신지식 과 보다는 곳 수 바깥 역할 벼룩 질문 . 꿰어 중'
TRUE: '카를로스 신지식 과 보다는 곳 수 바깥 역할 벼룩 질문 . 꿰어 중'
#00070665: loss=1.44664021 ler=0.01071776 dt=0.05732183s
PRED: '쟌느 분 코뮌 디앤샵 건의 반침 19 헌법 법령 프톨레마이오스 > 골'
TRUE: '쟌느 분 코뮌 디앤샵 건의 방침 19 헌법 법령 프톨레마이오스 > 골'
#00070666: loss=1.44412356 ler=0.01071776 dt=0.05723174s
PRED: '17 숙박 조각 다룬다 커스텀 최저가 것이 사건 맥 답하기 뻘 탭'
TRUE: '17 숙박 조각 다룬다 커스텀 최저가 것이 사건 맥 답하기 뻘 탭'
My sample prediction(sentence not in my training dataset) seems good
TRUE: 원대복귀 조치에 따라 둘은 육군으로 돌아가게 됐다.
PRE: 원대복귀 조치에 따라 둘은 육군으로 돌아가게 됐다.
Thanks again :+1:
P.S) In your README.md, It said Modules to segment pages into lines will be available soon.
You recomend to use OCRopy scripts. But It's not that good.
When can I check this module?
@a41888936 I'm very glad u got this working! Unfortunately, the line segmentation part of our complete OCR-workflow also relies on the OCRopy scripts, therefore this module wont help you neither.
First of all, Thanks for sharing amazing library.
I want train model for Korean.
Unlike english, Korean uses around 10000 different characters.
I've been training using this model. But It seems not works well.
Should I change model hyper parameter or model structure ??
Any advice??