clovaai / deep-text-recognition-benchmark

Text recognition (optical character recognition) with deep learning methods, ICCV 2019
Apache License 2.0
3.75k stars 1.1k forks source link

How to generate own dataset for transfer learning #166

Open vaibhavjussspacetech opened 4 years ago

vaibhavjussspacetech commented 4 years ago

Hi, I am new in the field of text recognition. I go through "When you need to train on your own dataset or Non-Latin language datasets." post in which for the generation of new data is by calling a "create_lmdb_dataset.py" file by supplying two inputs, path for data and path for ground truth. But I can't understand what this data folder contains? is it a set of natural images containing a text or it's simple "empty" folder. and on what basis I can generate ground truth?

I need to create a database containing digit along with character with a format like "Arial", "MICR" etc.

I have .ttf file for all fonts with me by using those fonts I would like to generate the dataset which will further use for transfer learning.

Please guide me. Thanks in advance.

ku21fan commented 4 years ago

Hello,

Sorry for the late reply.

I just updated our README and added the structure of data folder as below.

data
├── gt.txt
└── test
    ├── word_1.png
    ├── word_2.png
    ├── word_3.png
    └── ...

You can generate the datasets with your fonts by using text generation engines such as MJSynth and SynthText.

Hope it helps.

Best.