TeluguOCR / banti_telugu_ocr

End to end OCR system for Telugu. Based on Convolutional Neural Networks.
Apache License 2.0
49 stars 17 forks source link

how to use your serial project to realize a ocr for chinese #14

Open wanghaisheng opened 8 years ago

wanghaisheng commented 8 years ago

i have a lot of xps/pdf file which can transform to jpeg files, 1.do i need to generate millions of chinese characters like your " datagen_initio " 2.what about font and encoding for chinese Character "Mallicodes" 3.do i need to prepare box files generated by antanci_segmenter /OCR Segmenter

ChillarAnand commented 8 years ago

@rakeshvar It would be great if you can list the steps needs to followed to extended banti to other languages.

rakeshvar commented 8 years ago

@wanghaisheng You might have a lot of implementations of Chinese OCR elsewhere on the web. It is a problem that has received much more attention than the Indian language OCRs. But if you want to follow along the same lines. Here is a brief outline.

  1. Generate a lot of images to train a CNN with and then train the CNN.
  2. Redesign the segmentation part (page.py) to better suit Chinese (you should be able to find chinese text segmenters online too.)
  3. You need to specify an ngram dictionary of counts (build123grams.py)
rakeshvar commented 8 years ago

@ChillarAnand I am not sure how good the banti framework is for extension. It can be, there is no doubt. I am thinking of the chamanti framework which is much more easy to extend. You might be interested in working on that. I can post guidelines for that.

What do you think is the best way to make this collaborative with minimal amount of work from my side (I really can not spend much time on these things). A github.io page ? A google group? Ideally there will be a post, and a scope for discussions and questions. Please do suggest. Thanks.

ChillarAnand commented 8 years ago

@rakeshvar Should we use Github issue tracker itself for discussion?

wanghaisheng commented 8 years ago

a blog post would be best