clovaai / deep-text-recognition-benchmark

Text recognition (optical character recognition) with deep learning methods, ICCV 2019
Apache License 2.0
3.71k stars 1.09k forks source link

ValueError: num_samples should be a positive integer value, but got num_samples=0 #290

Open youngsirsk opened 3 years ago

youngsirsk commented 3 years ago

When I create lmdb dataset with custom data(e.g. with trdg create multi words in per image), I followed the guide to create dataset, and I trained with below code: CUDA_VISIBLE_DEVICES=0,1,2,3 python3 train.py \ --exp_name CRNN_CTC_demo \ --select_data / \ --batch_ratio 1 \ --train_data lmdb_dataset/train/ --valid_data lmdb_dataset/val/ \ --Transformation TPS --FeatureExtraction VGG --SequenceModeling BiLSTM --Prediction CTC these are the print on the screen:

------ Use multi-GPU setting ------ if you stuck too long time with multi-GPU setting, try to set --workers 0 Filtering the images containing characters which are not in opt.character Filtering the images whose label is longer than opt.batch_max_length

dataset_root: lmdb_dataset/train/ opt.select_data: ['/'] opt.batch_ratio: ['1']

dataset_root: lmdb_dataset/train/ dataset: / sub-directory: /. num samples: 0 num total samples of /: 0 x 1.0 (total_data_usage_ratio) = 0 num samples of / per batch: 768 x 1.0 (batch_ratio) = 768 Traceback (most recent call last): File "train.py", line 317, in train(opt) File "train.py", line 31, in train train_dataset = Batch_Balanced_Dataset(opt) File "/home/yxy/deep-text-recognition-benchmark/dataset.py", line 67, in init collate_fn=_AlignCollate, pin_memory=True) File "/home/yxy/anaconda3/envs/crnn_ctc/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 262, in init sampler = RandomSampler(dataset, generator=generator) # type: ignore File "/home/yxy/anaconda3/envs/crnn_ctc/lib/python3.6/site-packages/torch/utils/data/sampler.py", line 104, in init "value, but got num_samples={}".format(self.num_samples)) ValueError: num_samples should be a positive integer value, but got num_samples=0`

It shows there is a error when loading data, I have no ideas why and how it happend. when I trained with one word in a image, it works well. Works: A word, for example "code". Bug: Multiple words, for example "I love coding"

By the way, I found there are some special characters between word, which will occur the same error.(e.g. a_b_c_d_e)

I have no ideas how to fix the problem. Can anyone help me? Thanks a lot!!!

kimlia545 commented 3 years ago

You need to check this part.

186

train.py line 25 print('Filtering the images containing characters which are not in opt.character') print('Filtering the images whose label is longer than opt.batch_max_length')

dataset.py line 168

By default, images containing characters which are not in opt.character are filtered.

line 190

We only train and evaluate on alphanumerics (or pre-defined character set in train.py)