A BIG problem with --opt.character and --data_filtering_off

clovaai / deep-text-recognition-benchmark

Text recognition (optical character recognition) with deep learning methods, ICCV 2019

Apache License 2.0

3.77k stars 1.11k forks source link

A BIG problem with --opt.character and --data_filtering_off #186

Open 2113vm opened 4 years ago

2113vm commented 4 years ago

I have trained a model on my custom dataset. My dataset contains about 88k images of words and labels. And once I saw that the model was training only on 40k images. The problem was my alphabet contain the special symbols, e.g. ][?!*^. As I saw later, part of the data was skipped, when data was loading. The reason is how works --data_filtering_off. It uses re.search function with pattern f[^{opt.character}]. And when you use the alphabet with special symbols for a regular expression, your data can be skipped. You also can't add '\' for any special symbols because then you have more num_classes than it be.

wolfryu commented 4 years ago

i my case, training without --data_filtering_off, the model shows 60% ACC, training with --data_filtering_off, the model shows 30% ACC ...

2113vm commented 4 years ago

But I have improved my accuracy. But I have done not the same. I added '\' before every special symbol in opt.character, I didn't use --data_filtering_off flag. I don't guarantee that it's the correct way, because I could make a mistake with a specail symbol or there was incorrect behavior with num_class. But, I want to note, that before fixing the bug my model didn't predict correctly part of the alphabet. The predictions were bad even for another part of the alphabet. And accuracy was ~79%. After fixing the bug I had the accuracy ~82%, but the predictions were far better. Maybe, in your case, the model has accuracy less but has more correct predictions because the model knows more symbols than your penultimate model.

freedom9393 commented 4 years ago

I have trained a model on my custom dataset. My dataset contains about 88k images of words and labels. And once I saw that the model was training only on 40k images. The problem was my alphabet contain the special symbols, e.g. ][?!*^. As I saw later, part of the data was skipped, when data was loading. The reason is how works --data_filtering_off. It uses re.search function with pattern f[^{opt.character}]. And when you use the alphabet with special symbols for a regular expression, your data can be skipped. You also can't add '\' for any special symbols because then you have more num_classes than it be.

Authors mention that --data_filtering_off is for alphanumeric characters: check this link. And that's why your training skipped special characters

freedom9393 commented 4 years ago

i my case, training without --data_filtering_off, the model shows 60% ACC, training with --data_filtering_off, the model shows 30% ACC ...

It's because, --data_filtering_off filters alphanumeric characters and ignore special characters