breezedeus / CnOCR

CnOCR: Awesome Chinese/English OCR Python toolkits based on PyTorch. It comes with 20+ well-trained models for different application scenarios and can be used directly after installation. 【基于 PyTorch/MXNet 的中文/英文 OCR Python 包。】
https://www.breezedeus.com/article/cnocr
Apache License 2.0
3.24k stars 504 forks source link

the format of data_root, train_file and test_file #2

Closed ShaneYS closed 5 years ago

ShaneYS commented 5 years ago

thanks for your great work. I want to train net on my own data, but I don't know the format of training set. So can you tell me the format of image name and train txt file? Besides, is there any requirement for image size? Thanks

breezedeus commented 5 years ago

Hi. You can find the data format desc here: https://github.com/diaomin/crnn-mxnet-chinese-text-recognition. Like this:

Data Preparation

  1. Download the Synthetic Chinese Dataset(contributed by https://github.com/senlinuc/caffe_ocr and many thanks)

    A glance of the dataset:

    • almost 3.6 million synthetic chinese text images.
    • 5,990 different categories in total.
    • each image has a length of 10 characters.
  2. Create train.txt and text.txt with the format like this:
           image_name1 label1_1 label1_2 label1_3...
           image_name2 label2_1 label2_2 label2_3...
breezedeus commented 5 years ago

Examples:

image

breezedeus commented 5 years ago

Image size should be 280 (width) * 32 (height). An example: https://github.com/breezedeus/cnocr/blob/master/examples/20457890_2399557098.jpg