clovaai / deep-text-recognition-benchmark

Text recognition (optical character recognition) with deep learning methods, ICCV 2019
Apache License 2.0
3.68k stars 1.08k forks source link

Where to put MJ-ST datasets #304

Open vladimirKa002 opened 2 years ago

vladimirKa002 commented 2 years ago

Hello. I have downloaded datasets from MJSynth and SynthText. Also, I have my own dataset for training that consists of ~300 images. In order to train the model, I have to run this script as the article states: CUDA_VISIBLE_DEVICES=0 python3 train.py \ --train_data my_dataset/training --valid_data my_dataset/validation \ --select_data MJ-ST --batch_ratio 0.5-0.5 \ --Transformation TPS --FeatureExtraction ResNet --SequenceModeling BiLSTM --Prediction Attn But I have some problems with organizing directories. Could please someone explain where should I place MJ and ST datasets and my own lmdb dataset?

Bonnie-gift commented 2 years ago

hi have you ever solved the problem?

ketangangal commented 2 years ago

Facing Same issue

dotrungkien3210 commented 2 years ago

this two dataset have diffirent format, diffirent image name so diffirent way to anotation so it is very hard to combine. i want to see how you can setup folder and file too

devarshi16 commented 1 month ago

Sorry I am late to the party but I think I have figured this out. The instructions are given in the repo README.md but were a little confusing. Hopefully this helps others.

For the sake of completeness I'll start from converting our data to the lmdb dataset format which is necessary.

Ensure that the data structure is like so,

LMDB_DATA
   |-------- training
   |-------- validation
MY_DATASETS
   |--------my_data1.txt
   |--------my_data2.txt
   |--------my_data1_val.txt
   |--------my_data2_val.txt
   |--------my_data1
   |            |--------img1.jpg
   |            |--------img2.jpg
   |            |--------img2.jpg
   |                : 
   |                :
   |--------my_data1
   |            |--------img1.jpg
   |            |--------img2.jpg
   |            |--------img2.jpg
   |                : 
   |                :
   |      :          
   |      : 
create_lmdb_dataset.py
train.py
 : 
 : 

Example my_data1.txt content,

my_data1/img1.jpg
my_data1/img2.jpg
: 
: 

For creating the lmdb dataset you need to run the following for each dataset,

$ python3 create_lmdb_dataset.py --inputPath MY_DATASETS/ --gtFile MY_DATASETS/my_data1.txt --outputPath LMDB_DATA/training/my_data1
$ python3 create_lmdb_dataset.py --inputPath MY_DATASETS/ --gtFile MY_DATASETS/my_data2.txt --outputPath LMDB_DATA/training/my_data2

Make sure you create validation set aswell, for this you'll need separate txt

$ python3 create_lmdb_dataset.py --inputPath MY_DATASETS/ --gtFile MY_DATASETS/my_data1_val.txt --outputPath LMDB_DATA/validation/my_data1
$ python3 create_lmdb_dataset.py --inputPath MY_DATASETS/ --gtFile MY_DATASETS/my_data2_val.txt --outputPath LMDB_DATA/validation/my_data2

Now you can start the training,

$ CUDA_VISIBLE_DEVICES=1,2,3,4,5,6,7 python3 train.py \
--workers 0 \
--exp_name retrain1 \
--batch_size 32 \
--train_data LMDB_DATA \
--valid_data LMDB/validation \
--select_data my_data1-my_data2-my_data3 \
--batch_ratio 0.25-0.25-0.5 \
--Transformation TPS \
--FeatureExtraction ResNet \
--SequenceModeling BiLSTM \
--Prediction Attn

Notice that in --train_data I didn't explicitly specify LMDB_DATA/training and only specified LMDB_DATA whereas I supplied LMDB/validation explicitly in --valid_data argument. This discrepency can be easily addressed in the code base for uniform intuitive usage. Another issue that I observed was that sometimes training fails during data loading not because there was an issue with the training command or the data supplied but because of some unknown issue in how the code samples by --batch_ratio. That is, I simply change up the batch ratios and the training starts as intended. I did not have the time to make a fix in the code base right now.

I hope this helps.