clovaai / deep-text-recognition-benchmark

Text recognition (optical character recognition) with deep learning methods, ICCV 2019
Apache License 2.0
3.69k stars 1.08k forks source link

retrain EasyOCR for non latin language #356

Open ftmasadi opened 1 year ago

ftmasadi commented 1 year ago

Hello I encountered this error when I wanted retrain the network using my dataset, which is in Farsi language. Can anyone help me, what is the reason for this and I used the trainer file for retraining? But when I do the same way with the Latin dataset, it done, I should change some values ​​for Persian language or non-Latin languages, maybe this is the reason for this error? thanks for help me issue

ftmasadi commented 1 year ago

Duplicate of #356 [Filtering the images containing characters which are not in opt.character Filtering the images whose label is longer than opt.batch_max_length

dataset_root: all_data opt.select_data: ['en_train_filtered'] opt.batch_ratio: ['1']

dataset_root: all_data dataset: en_train_filtered all_data/en_train_filtered sub-directory: /en_train_filtered num samples: 0 num total samples of en_train_filtered: 0 x 1.0 (total_data_usage_ratio) = 0 num samples of en_train_filtered per batch: 32 x 1.0 (batch_ratio) = 32

ValueError Traceback (most recent call last) ~\AppData\Local\Temp\ipykernel_9016\1418931287.py in <cell line: 2>() 1 opt = get_config("config_files/en_filtered_config.yaml") ----> 2 train(opt, amp=False)

~\Desktop\shotor-dataset\Shotor_Images\EasyOCR-master\trainer\train.py in train(opt, show_number, amp) 38 opt.select_data = opt.select_data.split('-') 39 opt.batch_ratio = opt.batch_ratio.split('-') ---> 40 train_dataset = Batch_Balanced_Dataset(opt) 41 42 log = open(f'./saved_models/{opt.experiment_name}/log_dataset.txt', 'a', encoding="utf8")

~\Desktop\shotor-dataset\Shotor_Images\EasyOCR-master\trainer\dataset.py in init(self, opt) 75 Total_batch_size += _batch_size 76 ---> 77 _data_loader = torch.utils.data.DataLoader( 78 _dataset, batch_size=_batch_size, 79 shuffle=True,

~\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\utils\data\dataloader.py in init(self, dataset, batch_size, shuffle, sampler, batch_sampler, num_workers, collate_fn, pin_memory, drop_last, timeout, worker_init_fn, multiprocessing_context, generator, prefetch_factor, persistent_workers, pin_memory_device) 351 else: # map-style 352 if shuffle: --> 353 sampler = RandomSampler(dataset, generator=generator) # type: ignore[arg-type] 354 else: 355 sampler = SequentialSampler(dataset) # type: ignore[arg-type]

~\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\utils\data\sampler.py in init(self, data_source, replacement, num_samples, generator) 105 106 if not isinstance(self.num_samples, int) or self.num_samples <= 0: --> 107 raise ValueError("num_samples should be a positive integer " 108 "value, but got num_samples={}".format(self.num_samples)) 109

ValueError: num_samples should be a positive integer value, but got num_samples=0](url)