Closed Eurus-Holmes closed 4 years ago
Please refer to generate_abcnet_json.py
:
recs[ix] = len(cV2)
You can post what you have changed, otherwise I cannot know what did you miss.
@Yuliang-Liu
Yes, I noticed recs[ix] = len(cV2)
.
As for me, I set len(cV2) == len(CTLABELS) == 4135
, _C.MODEL.BATEXT.VOC_SIZE == 4136
, which including all character classes.
I have changed adet/config/defaults.py
# ---------------------------------------------------------------------------- #
# BAText Options
# ---------------------------------------------------------------------------- #
_C.MODEL.BATEXT.VOC_SIZE = 4136
adet/evaluation/text_evaluation.py
, CTLABELS
, which including all English and Chinese characters.
filepath = './ch.json'
CTLABELS = []
with open(filepath, 'r') as f:
data = json.load(f)
for key, value in data.items():
CTLABELS.append(value)
def ctc_decode(rec):
# ctc decoding
last_char = False
s = ''
for c in rec:
c = int(c)
if c < 4135:
if last_char != c:
s += CTLABELS[c]
last_char = c
elif c == 4135:
s += u'口'
else:
last_char = False
return s
def decode(rec):
s = ''
for c in rec:
c = int(c)
if c < 4135:
s += CTLABELS[c]
elif c == 4135:
s += u'口'
# print(s)
return s
Other parts have not changed.
At last, the output result is:
Calculated!
"E2E_RESULTS: precision: 0.0, recall: 0.0, hmean: 0"
"DETECTION_ONLY_RESULTS: precision: 0.875, recall: 0.3181818181818182, hmean: 0.4666666666666667"
[07/11 05:49:19 d2.engine.defaults]: Evaluation results for test in csv format:
[07/11 05:49:19 d2.evaluation.testing]: copypaste: Task: E2E_RESULTS
[07/11 05:49:19 d2.evaluation.testing]: copypaste: precision,recall,hmean
[07/11 05:49:19 d2.evaluation.testing]: copypaste: 0.0000,0.0000,0.0000
[07/11 05:49:19 d2.evaluation.testing]: copypaste: Task: DETECTION_ONLY_RESULTS
[07/11 05:49:19 d2.evaluation.testing]: copypaste: precision,recall,hmean
[07/11 05:49:19 d2.evaluation.testing]: copypaste: 0.8750,0.3182,0.4667
[07/11 05:49:19 d2.utils.events]: eta: 0:00:02 iter: 19999 total_loss: 0.895 rec_loss: 0.289 loss_fcos_cls: 0.000 loss_fcos_loc: 0.009 loss_fcos_ctr: 0.596 loss_fcos_bezier: 0.002 time: 2.0044 data_time: 0.0029 lr: 0.000050 max_mem: 2667M
[07/11 05:49:19 d2.engine.hooks]: Overall training speed: 19997 iterations in 11:08:05 (2.0046 s / it)
[07/11 05:49:19 d2.engine.hooks]: Total training time: 11:08:28 (0:00:23 on hooks)
I noticed the iterations is 260000 in the configs/BAText/Pretrain/attn_R_50.yaml
, my E2E_RESULTS
is 0, is that because iters not enough? 20000 iters still is 0, is that normal?
@Yuliang-Liu After 170000 iterations, the output result is:
Calculated!
"E2E_RESULTS: precision: 0.014097456328532026, recall: 0.008024188859169671, hmean: 0.010227146403824064"
"DETECTION_ONLY_RESULTS: precision: 0.9456532842987027, recall: 0.5382602628212583, hmean: 0.6860340163782561"
[32m[07/15 09:30:10 d2.engine.defaults]: [0mEvaluation results for ReCTS_test in csv format:
[32m[07/15 09:30:10 d2.evaluation.testing]: [0mcopypaste: Task: E2E_RESULTS
[32m[07/15 09:30:10 d2.evaluation.testing]: [0mcopypaste: precision,recall,hmean
[32m[07/15 09:30:10 d2.evaluation.testing]: [0mcopypaste: 0.0141,0.0080,0.0102
[32m[07/15 09:30:10 d2.evaluation.testing]: [0mcopypaste: Task: DETECTION_ONLY_RESULTS
[32m[07/15 09:30:10 d2.evaluation.testing]: [0mcopypaste: precision,recall,hmean
[32m[07/15 09:30:10 d2.evaluation.testing]: [0mcopypaste: 0.9457,0.5383,0.6860
It seems that other reasons caused such E2E_RESULTS
, have I missed anything else?
@Eurus-Holmes
Have you used chinese synthetic data to pretrain before finetuning on ReCTs dataset?
What is the number of batch sizes, and did you change the training scales.
Iterations are probably not enough because of large amount of classes and class imbalance of Chinese text.
The result seems normal, but the rec_loss is still high, suggesting it hadn't converged very well. Also, can you also visualize some results and post here?
@Yuliang-Liu
I have not used Chinese synthetic data to pretrain. Currently, my IMS_PER_BATCH
is 16, datasets include 17k train images and 3k test images, 8 GPU to train.
I have changed BASE_LR
to 0.005, after 140000 iters, the output results is:
[07/16 06:37:29] d2.engine.defaults INFO: Evaluation results for ReCTS_test in csv format:
[07/16 06:37:29] d2.evaluation.testing INFO: copypaste: Task: E2E_RESULTS
[07/16 06:37:29] d2.evaluation.testing INFO: copypaste: precision,recall,hmean
[07/16 06:37:29] d2.evaluation.testing INFO: copypaste: 0.1141,0.0944,0.1033
[07/16 06:37:29] d2.evaluation.testing INFO: copypaste: Task: DETECTION_ONLY_RESULTS
[07/16 06:37:29] d2.evaluation.testing INFO: copypaste: precision,recall,hmean
[07/16 06:37:29] d2.evaluation.testing INFO: copypaste: 0.8816,0.7293,0.7983
I think more iterations will work, thanks for your help!
@Yuliang-Liu Hi, I have done 300000 iters, but the final result is not ideal:
Calculated!
"E2E_RESULTS: precision: 0.11593286988273295, recall: 0.09599953482963135, hmean: 0.1050287859028595"
"DETECTION_ONLY_RESULTS: precision: 0.8897549329401025, recall: 0.7367717176415862, hmean: 0.806068895321098"
[32m[07/17 06:10:36 d2.engine.defaults]: [0mEvaluation results for ReCTS_test in csv format:
[32m[07/17 06:10:36 d2.evaluation.testing]: [0mcopypaste: Task: E2E_RESULTS
[32m[07/17 06:10:36 d2.evaluation.testing]: [0mcopypaste: precision,recall,hmean
[32m[07/17 06:10:36 d2.evaluation.testing]: [0mcopypaste: 0.1159,0.0960,0.1050
[32m[07/17 06:10:36 d2.evaluation.testing]: [0mcopypaste: Task: DETECTION_ONLY_RESULTS
[32m[07/17 06:10:36 d2.evaluation.testing]: [0mcopypaste: precision,recall,hmean
[32m[07/17 06:10:36 d2.evaluation.testing]: [0mcopypaste: 0.8898,0.7368,0.8061
[32m[07/17 06:10:36 d2.utils.events]: [0m eta: 0:00:00 iter: 299999 total_loss: 0.945 rec_loss: 0.282 loss_fcos_cls: 0.009 loss_fcos_loc: 0.044 loss_fcos_ctr: 0.600 loss_fcos_bezier: 0.009 time: 0.5128 data_time: 0.0285 lr: 0.000500 max_mem: 7644M
[32m[07/17 06:10:36 d2.engine.hooks]: [0mOverall training speed: 299997 iterations in 1 day, 18:43:56 (0.5128 s / it)
[32m[07/17 06:10:36 d2.engine.hooks]: [0mTotal training time: 1 day, 19:32:42 (0:48:45 on hooks)
And I noticed rec_loss
is very strange:
I can't imagine that it actually changes periodically and clearly it hadn't converged normally. What happened to it?
What's more, it seems that the current code doesn't set the convergence condition, but only sets max_iter
. I think that more convergence conditions should be added to help train.
@Yuliang-Liu
Another question: at the output
folder, the model_final.pth
is the best model or the last model?
If it is the best model, then under what condition is it saved? precision
, recall
, or hmean
? It seems no relevant code to explain this.
@Eurus-Holmes
Q1: Try using a smaller learning rate? only max_iter
Q2: Last model. To avoid using test set for validation purpose.
@Yuliang-Liu
Thanks! I'll try smaller lr
, but if only max_iter
, how can I know my model is the best result? In other words, how can I know the algorithm already converge? And that's why I want to ask model_final.pth is the best model or the last model. Actually I have not found the code about model_final.pth
is saved.
@Eurus-Holmes It's all based on empiricism.
@Eurus-Holmes It's all based on empiricism.
Fine, thanks.
@Eurus-Holmes I synthesized 130k bilingual images (including Chinese and English). Train by single 2080 ti for one day. Here are the results on validation set. (Good in validation set, but poor to generalize to real data)
The rec_loss is similar to yours (about 0.3). I will test the quantitative results; use real data; and try to train it for a long schedule. I will share what I have found here and it would be much appreciated that you can also share your experiences.
@Yuliang-Liu Hi, thanks! That will be very helpful! I'll reopen this issue and share my experiences.
@Yuliang-Liu I made a data set labeled in Chinese by labelme.
First, I use python Bezier_generator2.py
.error UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb5 in position 81: invalid start byte so I have changed python Bezier_generator2.py
line 170 : with open(label,"r") as f:
to with open(label,"r",encoding='gbk') as f:
There is an error when I using 'python generate_abcnet_json.py
' ,all the chinese 'rec' in train.json
is shown as 102 (102=len(cV2) ,but I have changed cV2 to my own Chinese data),have I missed anything else?
@Yuliang-Liu
Hi, I have changed my lr
to 0.0001
, however, the recognition module still not get a good result.
After 500000 iters,
Calculated!
"E2E_RESULTS: precision: 0.11384842585755521, recall: 0.09861611815327363, hmean: 0.10568624396323414"
"DETECTION_ONLY_RESULTS: precision: 0.8717862656910788, recall: 0.7551459472031632, hmean: 0.8092849353481851"
[07/20 11:06:57 d2.engine.defaults]: Evaluation results for ReCTS_test in csv format:
[07/20 11:06:57 d2.evaluation.testing]: copypaste: Task: E2E_RESULTS
[07/20 11:06:57 d2.evaluation.testing]: copypaste: precision,recall,hmean
[07/20 11:06:57 d2.evaluation.testing]: copypaste: 0.1138,0.0986,0.1057
[07/20 11:06:57 d2.evaluation.testing]: copypaste: Task: DETECTION_ONLY_RESULTS
[07/20 11:06:57 d2.evaluation.testing]: copypaste: precision,recall,hmean
[07/20 11:06:57 d2.evaluation.testing]: copypaste: 0.8718,0.7551,0.8093
At last,
total_loss: 0.891 rec_loss: 0.249 loss_fcos_cls: 0.002 loss_fcos_loc: 0.035 loss_fcos_ctr: 0.603 loss_fcos_bezier: 0.007 time: 0.5092 data_time: 0.0271 lr: 0.000100 max_mem: 7386M
I visualized the E2E_RESULTS
hmean
curve:
It seems like it has converged, but why the hmean
value is so poor?
Some test results on real data (from ICDAR 2019 datasets):
@Yuliang-Liu I made a data set labeled in Chinese by labelme.
First, I use
python Bezier_generator2.py
.error UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb5 in position 81: invalid start byte so I have changedpython Bezier_generator2.py
line 170 :with open(label,"r") as f:
towith open(label,"r",encoding='gbk') as f:
There is an error when I using '
python generate_abcnet_json.py
' ,all the chinese 'rec' intrain.json
is shown as 102 (102=len(cV2) ,but I have changed cV2 to my own Chinese data),have I missed anything else?
It seems that there is a problem with your encoding format.
@Eurus-Holmes I achieve the similar results to yours. There are four possible reasons:
Metric. Evaluating Chinese recognition performance usually adopts 1-NED metric, but we use word accuracy in this implementation, i.e., one character error will result in a false positive. For example, “百鸭传奇” is not matched to "匠鸭传奇“ so it is a false positive.
Decoder. Attention mechanism is known to good at learning the semantic context [1], which is better than CTC on English but worse on Chinese based on our previous experiences. ReCT dataset also provides character-level bounding box which should be useful for Chinese recognition [2].
[1] Wan, Zhaoyi, et al. "On Vocabulary Reliance in Scene Text Recognition." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020.
[2] Xing L, Tian Z, Huang W, et al. Convolutional character networks[C]//Proceedings of the IEEE International Conference on Computer Vision. 2019: 9126-9136.
Sequence. Have you checked the annotating sequence for "青椒" like instance. I mean, are the interpolating points on the short side or long side. ReCT has many such data, which might be a problem.
Data. As Chinese text recognition problem is much difficult than English one, the number of the training data may not be enough. We usually trained an independent recognition model with a significantly large number of data.
@Eurus-Holmes I achieve the similar results to yours. There are four possible reasons:
- Metric. Evaluating Chinese recognition performance usually adopts 1-NED metric, but we use word accuracy in this implementation, i.e., one character error will result in a false positive. For example, “百鸭传奇” is not matched to "匠鸭传奇“ so it is a false positive.
- Decoder. Attention mechanism is known to good at learning the semantic context [1], which is better than CTC on English but worse on Chinese based on our previous experiences. ReCT dataset also provides character-level bounding box which should be useful for Chinese recognition [2].
[1] Wan, Zhaoyi, et al. "On Vocabulary Reliance in Scene Text Recognition." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020.
[2] Xing L, Tian Z, Huang W, et al. Convolutional character networks[C]//Proceedings of the IEEE International Conference on Computer Vision. 2019: 9126-9136.
- Sequence. Have you checked the annotating sequence for "青椒" like instance. I mean, are the interpolating points on the short side or long side. ReCT has many such data, which might be a problem.
- Data. As Chinese text recognition problem is much difficult than English one, the number of the training data may not be enough. We usually trained an independent recognition model with a significantly large number of data.
Hi @Yuliang-Liu .I have the same problem like 4 DATA, the recognition branch is not robust enough for Chinese recognition. So I want to add an independent recognition model and use the output of BezierAlign, but my output of BezierAlign is as follows:
and the origin is:
I want the results like your paper to deal curve text:
Could your give me some advices? Thanks.
@lzneu We had provided a BezierAlign example there. You can follow the same way to create one based on this version.
@Eurus-Holmes I achieve the similar results to yours. There are four possible reasons:
- Metric. Evaluating Chinese recognition performance usually adopts 1-NED metric, but we use word accuracy in this implementation, i.e., one character error will result in a false positive. For example, “百鸭传奇” is not matched to "匠鸭传奇“ so it is a false positive.
- Decoder. Attention mechanism is known to good at learning the semantic context [1], which is better than CTC on English but worse on Chinese based on our previous experiences. ReCT dataset also provides character-level bounding box which should be useful for Chinese recognition [2].
[1] Wan, Zhaoyi, et al. "On Vocabulary Reliance in Scene Text Recognition." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020.
[2] Xing L, Tian Z, Huang W, et al. Convolutional character networks[C]//Proceedings of the IEEE International Conference on Computer Vision. 2019: 9126-9136.
- Sequence. Have you checked the annotating sequence for "青椒" like instance. I mean, are the interpolating points on the short side or long side. ReCT has many such data, which might be a problem.
- Data. As Chinese text recognition problem is much difficult than English one, the number of the training data may not be enough. We usually trained an independent recognition model with a significantly large number of data.
@Yuliang-Liu Thanks for your advice so much! I'll make some improvements based on these possible reasons.
@lzneu We had provided a BezierAlign example there. You can follow the same way to create one based on this version.
Thanks
@Eurus-Holmes I achieve the similar results to yours. There are four possible reasons:
- Metric. Evaluating Chinese recognition performance usually adopts 1-NED metric, but we use word accuracy in this implementation, i.e., one character error will result in a false positive. For example, “百鸭传奇” is not matched to "匠鸭传奇“ so it is a false positive.
- Decoder. Attention mechanism is known to good at learning the semantic context [1], which is better than CTC on English but worse on Chinese based on our previous experiences. ReCT dataset also provides character-level bounding box which should be useful for Chinese recognition [2].
[1] Wan, Zhaoyi, et al. "On Vocabulary Reliance in Scene Text Recognition." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020.
[2] Xing L, Tian Z, Huang W, et al. Convolutional character networks[C]//Proceedings of the IEEE International Conference on Computer Vision. 2019: 9126-9136.
- Sequence. Have you checked the annotating sequence for "青椒" like instance. I mean, are the interpolating points on the short side or long side. ReCT has many such data, which might be a problem.
- Data. As Chinese text recognition problem is much difficult than English one, the number of the training data may not be enough. We usually trained an independent recognition model with a significantly large number of data.
@Yuliang-Liu According to these possible reasons, I made some change.
I added the 1-NED metric,
def cal_sim(str1, str2):
"""
Normalized Edit Distance metric (1-N.E.D specifically)
"""
m = len(str1) + 1
n = len(str2) + 1
matrix = np.zeros((m, n))
for i in range(m):
matrix[i][0] = i
for j in range(n):
matrix[0][j] = j
for i in range(1, m):
for j in range(1, n):
if str1[i - 1] == str2[j - 1]:
matrix[i][j] = matrix[i - 1][j - 1]
else:
matrix[i][j] = min(matrix[i - 1][j - 1], min(matrix[i][j - 1], matrix[i - 1][j])) + 1
lev = matrix[m - 1][n - 1]
if (max(m - 1, n - 1)) == 0:
sim = 1.0
else:
sim = 1.0 - lev / (max(m - 1, n - 1))
return sim
def include_in_dictionary_transcription(transcription): ... ... ... matchedSum = 0 det_only_matchedSum = 0
Rectangle = namedtuple('Rectangle', 'xmin ymin xmax ymax')
gt = rrc_evaluation_funcs.load_zip_file(gtFilePath,evaluationParams['GT_SAMPLE_NAME_2_ID'])
subm = rrc_evaluation_funcs.load_zip_file(submFilePath,evaluationParams['DET_SAMPLE_NAME_2_ID'],True)
numGlobalCareGt = 0;
numGlobalCareDet = 0;
det_only_numGlobalCareGt = 0;
det_only_numGlobalCareDet = 0;
arrGlobalConfidences = [];
arrGlobalMatches = [];
Normalized_ED = 0
total_num = 0
for gtNum in range(len(gtPols)):
for detNum in range(len(detPols)):
if gtRectMat[gtNum] == 0 and detRectMat[detNum] == 0 and gtNum not in gtDontCarePolsNum and detNum not in detDontCarePolsNum :
if iouMat[gtNum,detNum]>evaluationParams['IOU_CONSTRAINT']:
gtRectMat[gtNum] = 1
detRectMat[detNum] = 1
#detection matched only if transcription is equal
# det_only_correct = True
# detOnlyCorrect += 1
if evaluationParams['WORD_SPOTTING']:
edd = lstn.distance(gtTrans[gtNum].upper(), detTrans[detNum].upper())
if edd<=0:
correct = True
else:
correct = False
# correct = gtTrans[gtNum].upper() == detTrans[detNum].upper()
else:
try:
correct = transcription_match(gtTrans[gtNum].upper(),detTrans[detNum].upper(),evaluationParams['SPECIAL_CHARACTERS'],evaluationParams['ONLY_REMOVE_FIRST_LAST_CHARACTER'])==True
sim = cal_sim(detTrans[detNum].upper(), gtTrans[gtNum].upper())
Normalized_ED += sim
except: # empty
correct = False
detCorrect += (1 if correct else 0)
if correct:
detMatchedNums.append(detNum)
Normalized_ED = 0 if Normalized_ED == 0 else float(Normalized_ED) / total_num
However, the 1-NED result is similar to `hmean`,
<img width="679" alt="Screen Shot 2020-07-22 at 3 38 56 PM" src="https://user-images.githubusercontent.com/34226570/88148644-7d8bd000-cc31-11ea-9265-2a965e9fc3e3.png">
2. I changed the Decoder to the CTC module, it seems that it is not better than `attention`,
"E2E_RESULTS: precision: 0.00013580498404291438, recall: 0.00011629259216187928, hmean: 0.00012529365700861392"
"DETECTION_ONLY_RESULTS: precision: 0.8812385414544713, recall: 0.7546226305384347, hmean: 0.8130305403288959"
[07/22 07:43:00 d2.engine.defaults]: Evaluation results for ReCTS_test in csv format:
[07/22 07:43:00 d2.evaluation.testing]: copypaste: Task: E2E_RESULTS
[07/22 07:43:00 d2.evaluation.testing]: copypaste: precision,recall,hmean
[07/22 07:43:00 d2.evaluation.testing]: copypaste: 0.0001,0.0001,0.0001
[07/22 07:43:00 d2.evaluation.testing]: copypaste: Task: DETECTION_ONLY_RESULTS
[07/22 07:43:00 d2.evaluation.testing]: copypaste: precision,recall,hmean
[07/22 07:43:00 d2.evaluation.testing]: copypaste: 0.8812,0.7546,0.8130
[07/22 07:43:00 d2.utils.events]: eta: 10:01:18 iter: 190999 total_loss: 1.071 rec_loss: 0.425 loss_fcos_cls: 0.002 loss_fcos_loc: 0.035 loss_fcos_ctr: 0.600 loss_fcos_bezier: 0.006 time: 0.3349 data_time: 0.0297 lr: 0.000010 max_mem: 5325M
@Eurus-Holmes
You can use lstn.distance(str1,str2)
to calculate edit distance.
What is the x-axis represents in your figure, and why 1-NED is lower than word accuracy?
For example, given str1='he'
and str2='she'
, according to 1-NED metric: 1 - ed(str1,str2) / max[length(str1), length(str2)]
, the 1-NED score is 0.67. For word accuracy, the hmean is zero. In this case, 1-NED is the final result so you don't have to calculate another hmean (I guess the hmean you posted is not from word accuracy?).
There might be some problems of current ctc
module in our implementation - shouldn't be so bad.
To improve the recognition performance, you can also try using a deeper recognition branch, and only updating the recognition branch by using the existing Chinese data similar to below figure.(I haven't try it yet)
@Yuliang-Liu
Thanks for your advice! The x-axis represents each checkpoint result, I set CHECKPOINT_PERIOD: 1000
, which is output after each inference
.
I have not changed hmean
calculate method, which is from your implementation, but I set self._word_spotting = False
.
Then I set self._word_spotting = True
, it seems that could achieve better results.
Calculated!
"E2E_RESULTS: precision: 0.36590131900341966, recall: 0.42971887550200805, hmean: 0.39525065963060685
"DETECTION_ONLY_RESULTS: precision: 0.8821482189163369, recall: 0.7516571694383067, hmean: 0.8116915735275649"
[07/23 06:07:32 d2.engine.defaults]: Evaluation results for ReCTS_test in csv format:
[07/23 06:07:32 d2.evaluation.testing]: copypaste: Task: E2E_RESULTS
[07/23 06:07:32 d2.evaluation.testing]: copypaste: precision,recall,hmean
[07/23 06:07:32 d2.evaluation.testing]: copypaste: 0.3659,0.4297,0.3953
[07/23 06:07:32 d2.evaluation.testing]: copypaste: Task: DETECTION_ONLY_RESULTS
[07/23 06:07:32 d2.evaluation.testing]: copypaste: precision,recall,hmean
[07/23 06:07:32 d2.evaluation.testing]: copypaste: 0.8821,0.7517,0.8117
[07/23 06:07:32 d2.utils.events]: eta: 1 day, 2:32:25 iter: 111999 total_loss: 0.876 rec_loss: 0.225 loss_fcos_cls: 0.003 loss_fcos_loc: 0.036 loss_fcos_ctr: 0.602 loss_fcos_bezier: 0.007 time: 0.5125 data_time: 0.0270 lr: 0.000001 max_mem: 7233M
About the recognition branch, the deeper network should be work. But I'm a little confused about only updating the recognition branch, are you mean only using the recognition branch to train and then get a model as WEIGHTS
to train again the whole network?
@Eurus-Holmes
I have not changed hmean calculate method, which is from your implementation,
If you use 1-NED, you don't need to use Hmean. That is, average of sum of 1-NED scores of all instances is the final result.
are you mean only using the recognition branch to train and then get a model as WEIGHTS to train again the whole network?
No, what I mean is you still train the whole model end-to-end like you did before. Only difference is to froze the weights of the backbone and detection branch.
@Eurus-Holmes
I have not changed hmean calculate method, which is from your implementation,
If you use 1-NED, you don't need to use Hmean. That is, average of sum of 1-NED scores of all instances is the final result.
are you mean only using the recognition branch to train and then get a model as WEIGHTS to train again the whole network?
No, what I mean is you still train the whole model end-to-end like you did before. Only difference is to froze the weights of the backbone and detection branch.
@Yuliang-Liu Thanks for your advice about the recognition branch! But I'm still a little confused about 1-NED metric, my code as following:
if evaluationParams['WORD_SPOTTING']:
edd = lstn.distance(gtTrans[gtNum].upper(), detTrans[detNum].upper())
if edd<=0:
correct = True
else:
correct = False
# correct = gtTrans[gtNum].upper() == detTrans[detNum].upper()
Normalized_ED += edd / max(len(gtTrans[gtNum].upper()), len(detTrans[detNum].upper()))
else:
try:
correct = transcription_match(gtTrans[gtNum].upper(),detTrans[detNum].upper(),evaluationParams['SPECIAL_CHARACTERS'],evaluationParams['ONLY_REMOVE_FIRST_LAST_CHARACTER'])==True
Normalized_ED += edd / max(len(gtTrans[gtNum].upper()), len(detTrans[detNum].upper()))
except: # empty
correct = False
detCorrect += (1 if correct else 0)
...
...
...
Normalized_ED = 0 if Normalized_ED == 0 else 1 - float(Normalized_ED) / total_num
But the last result is:
[07/25 16:07:22] d2.evaluation.testing INFO: copypaste: Task: E2E_RESULTS
[07/25 16:07:22] d2.evaluation.testing INFO: copypaste: precision,recall,hmean,Normalized_ED
[07/25 16:07:22] d2.evaluation.testing INFO: copypaste: 0.2496,0.2903,0.2684,0.9856
[07/25 16:07:22] d2.evaluation.testing INFO: copypaste: Task: DETECTION_ONLY_RESULTS
[07/25 16:07:22] d2.evaluation.testing INFO: copypaste: precision,recall,hmean
[07/25 16:07:22] d2.evaluation.testing INFO: copypaste: 0.8824,0.7510,0.8114
[07/25 16:07:22] d2.utils.events INFO: eta: 0:00:00 iter: 299999 total_loss: 1.088 rec_loss: 0.451 loss_fcos_cls: 0.003 loss_fcos_loc: 0.037 loss_fcos_ctr: 0.600 loss_fcos_bezier: 0.006 time: 0.5275 data_time: 0.0287 lr: 0.000001 max_mem: 8354M
Why Normalized_ED
could achieve 0.9856
? But hmean
is only 0.2684
? So strange.
@Yuliang-Liu
What's more, as for self._word_spotting
, what exactly the difference between True
and False
?
From adet/evaluation/text_eval_script.py
:
if evaluationParams['WORD_SPOTTING']:
edd = lstn.distance(gtTrans[gtNum].upper(), detTrans[detNum].upper())
if edd<=0:
correct = True
else:
correct = False
# correct = gtTrans[gtNum].upper() == detTrans[detNum].upper()
else:
try:
correct = transcription_match(gtTrans[gtNum].upper(),detTrans[detNum].upper(),evaluationParams['SPECIAL_CHARACTERS'],evaluationParams['ONLY_REMOVE_FIRST_LAST_CHARACTER'])==True
except: # empty
correct = False
detCorrect += (1 if correct else 0)
When edd<=0
, isn't it gtTrans[gtNum].upper()
match the detTrans[detNum].upper()
completely? That should be same with transcription_match
, right?
I evaluted the json result file offline following the evaluation_example_scripts, with self._word_spotting = True
and self._word_spotting = False
, seperately.
The results as following:
self._word_spotting = True
Calculated!
"E2E_RESULTS: precision: 0.3152732368052312, recall: 0.387263339070568, hmean: 0.34757981462409887"
"DETECTION_ONLY_RESULTS: precision: 0.8746584927034051, recall: 0.7632282823584138, hmean: 0.8151529265641981"
OrderedDict([('E2E_RESULTS', {'precision': 0.3152732368052312, 'recall': 0.387263339070568, 'hmean': 0.34757981462409887}), ('DETECTION_ONLY_RESULTS', {'precision': 0.8746584927034051, 'recall': 0.7632282823584138, 'hmean': 0.8151529265641981})])
self._word_spotting = False
Calculated!
"E2E_RESULTS: precision: 0.3386419670820284, recall: 0.29549947668333526, hmean: 0.31560316721006054"
"DETECTION_ONLY_RESULTS: precision: 0.8746584927034051, recall: 0.7632282823584138, hmean: 0.8151529265641981"
OrderedDict([('E2E_RESULTS', {'precision': 0.3386419670820284, 'recall': 0.29549947668333526, 'hmean': 0.31560316721006054}), ('DETECTION_ONLY_RESULTS', {'precision': 0.8746584927034051, 'recall': 0.7632282823584138, 'hmean': 0.8151529265641981})])
@Eurus-Holmes edd
is not defined in the else:
. Also, are sure \
operator for edd is calculated to output a float number? no int
devide by int
?
Why Normalized_ED could achieve 0.9856? But hmean is only 0.2684? So strange. Please print some examples to see why they are so different.
The implementation of wordspotting
is from ICDAR 2015 official evaluation code. You should be able to figure out the difference in the code.
@Yuliang-Liu
About edd
, do you mean:
if evaluationParams['WORD_SPOTTING']:
edd = lstn.distance(gtTrans[gtNum].upper(), detTrans[detNum].upper())
if edd<=0:
correct = True
else:
correct = False
# correct = gtTrans[gtNum].upper() == detTrans[detNum].upper()
Normalized_ED += float(edd) / max(len(gtTrans[gtNum].upper()), len(detTrans[detNum].upper()))
else:
edd = lstn.distance(gtTrans[gtNum].upper(), detTrans[detNum].upper())
try:
correct = transcription_match(gtTrans[gtNum].upper(),detTrans[detNum].upper(),evaluationParams['SPECIAL_CHARACTERS'],evaluationParams['ONLY_REMOVE_FIRST_LAST_CHARACTER'])==True
except: # empty
correct = False
Normalized_ED += float(edd) / max(len(gtTrans[gtNum].upper()), len(detTrans[detNum].upper()))
About wordspotting
, could you kindly provide the ICDAR 2015 official evaluation code
link? I don't find it from ICDAR 2015 homepage, but from adet/evaluation/text_eval_script.py
, it seems indeed no difference for True
or False
.
Hi, I use this repo to detect and recognize the Chinese word in my own dataset, and I have revised the code following the discussion above. However, I found that it can detect the location of word exactly, but it returns the wrong recognized results (almost all the recognized results are wrong). The sample image is shown as follow: Do I need to adjust the additional configs? such as the POOLER_RESOLUTION, POOLER_SCALES, CANONICAL_SIZE. Looking forward to your reply!
@youngboy52 您好,处理中文数据,需要修改哪些地方啊,可否说下?您这个VOC_SIZE=6804,是包含了6803个中文汉字吗,自己整理出的吗这个
@Eurus-Holmes I achieve the similar results to yours. There are four possible reasons:
- Metric. Evaluating Chinese recognition performance usually adopts 1-NED metric, but we use word accuracy in this implementation, i.e., one character error will result in a false positive. For example, “百鸭传奇” is not matched to "匠鸭传奇“ so it is a false positive.
- Decoder. Attention mechanism is known to good at learning the semantic context [1], which is better than CTC on English but worse on Chinese based on our previous experiences. ReCT dataset also provides character-level bounding box which should be useful for Chinese recognition [2].
[1] Wan, Zhaoyi, et al. "On Vocabulary Reliance in Scene Text Recognition." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020. [2] Xing L, Tian Z, Huang W, et al. Convolutional character networks[C]//Proceedings of the IEEE International Conference on Computer Vision. 2019: 9126-9136.
- Sequence. Have you checked the annotating sequence for "青椒" like instance. I mean, are the interpolating points on the short side or long side. ReCT has many such data, which might be a problem.
- Data. As Chinese text recognition problem is much difficult than English one, the number of the training data may not be enough. We usually trained an independent recognition model with a significantly large number of data.
@Yuliang-Liu According to these possible reasons, I made some change.
- I added the 1-NED metric,
def cal_sim(str1, str2): """ Normalized Edit Distance metric (1-N.E.D specifically) """ m = len(str1) + 1 n = len(str2) + 1 matrix = np.zeros((m, n)) for i in range(m): matrix[i][0] = i for j in range(n): matrix[0][j] = j for i in range(1, m): for j in range(1, n): if str1[i - 1] == str2[j - 1]: matrix[i][j] = matrix[i - 1][j - 1] else: matrix[i][j] = min(matrix[i - 1][j - 1], min(matrix[i][j - 1], matrix[i - 1][j])) + 1 lev = matrix[m - 1][n - 1] if (max(m - 1, n - 1)) == 0: sim = 1.0 else: sim = 1.0 - lev / (max(m - 1, n - 1)) return sim
def include_in_dictionary_transcription(transcription): ... ... ... matchedSum = 0 det_only_matchedSum = 0 Rectangle = namedtuple('Rectangle', 'xmin ymin xmax ymax') gt = rrc_evaluation_funcs.load_zip_file(gtFilePath,evaluationParams['GT_SAMPLE_NAME_2_ID']) subm = rrc_evaluation_funcs.load_zip_file(submFilePath,evaluationParams['DET_SAMPLE_NAME_2_ID'],True) numGlobalCareGt = 0; numGlobalCareDet = 0; det_only_numGlobalCareGt = 0; det_only_numGlobalCareDet = 0; arrGlobalConfidences = []; arrGlobalMatches = []; Normalized_ED = 0 total_num = 0
for gtNum in range(len(gtPols)): for detNum in range(len(detPols)): if gtRectMat[gtNum] == 0 and detRectMat[detNum] == 0 and gtNum not in gtDontCarePolsNum and detNum not in detDontCarePolsNum : if iouMat[gtNum,detNum]>evaluationParams['IOU_CONSTRAINT']: gtRectMat[gtNum] = 1 detRectMat[detNum] = 1 #detection matched only if transcription is equal # det_only_correct = True # detOnlyCorrect += 1 if evaluationParams['WORD_SPOTTING']: edd = lstn.distance(gtTrans[gtNum].upper(), detTrans[detNum].upper()) if edd<=0: correct = True else: correct = False # correct = gtTrans[gtNum].upper() == detTrans[detNum].upper() else: try: correct = transcription_match(gtTrans[gtNum].upper(),detTrans[detNum].upper(),evaluationParams['SPECIAL_CHARACTERS'],evaluationParams['ONLY_REMOVE_FIRST_LAST_CHARACTER'])==True sim = cal_sim(detTrans[detNum].upper(), gtTrans[gtNum].upper()) Normalized_ED += sim except: # empty correct = False detCorrect += (1 if correct else 0) if correct: detMatchedNums.append(detNum)
Normalized_ED = 0 if Normalized_ED == 0 else float(Normalized_ED) / total_num
However, the 1-NED result is similar to
hmean
,
- I changed the Decoder to the CTC module, it seems that it is not better than
attention
,"E2E_RESULTS: precision: 0.00013580498404291438, recall: 0.00011629259216187928, hmean: 0.00012529365700861392" "DETECTION_ONLY_RESULTS: precision: 0.8812385414544713, recall: 0.7546226305384347, hmean: 0.8130305403288959" [07/22 07:43:00 d2.engine.defaults]: Evaluation results for ReCTS_test in csv format: [07/22 07:43:00 d2.evaluation.testing]: copypaste: Task: E2E_RESULTS [07/22 07:43:00 d2.evaluation.testing]: copypaste: precision,recall,hmean [07/22 07:43:00 d2.evaluation.testing]: copypaste: 0.0001,0.0001,0.0001 [07/22 07:43:00 d2.evaluation.testing]: copypaste: Task: DETECTION_ONLY_RESULTS [07/22 07:43:00 d2.evaluation.testing]: copypaste: precision,recall,hmean [07/22 07:43:00 d2.evaluation.testing]: copypaste: 0.8812,0.7546,0.8130 [07/22 07:43:00 d2.utils.events]: eta: 10:01:18 iter: 190999 total_loss: 1.071 rec_loss: 0.425 loss_fcos_cls: 0.002 loss_fcos_loc: 0.035 loss_fcos_ctr: 0.600 loss_fcos_bezier: 0.006 time: 0.3349 data_time: 0.0297 lr: 0.000010 max_mem: 5325M
Hi, I met a situation that CTC decoder always predict the blank label during the model training, and the rec_loss looks normal in several iterations while it drops so quickly in one iteration: Did you meet this issue? Looking forward to your reply!
@Yuliang-Liu Yes, I noticed
recs[ix] = len(cV2)
. As for me, I setlen(cV2) == len(CTLABELS) == 4135
,_C.MODEL.BATEXT.VOC_SIZE == 4136
, which including all character classes.I have changed
adet/config/defaults.py
# ---------------------------------------------------------------------------- # # BAText Options # ---------------------------------------------------------------------------- # _C.MODEL.BATEXT.VOC_SIZE = 4136
adet/evaluation/text_evaluation.py
,CTLABELS
, which including all English and Chinese characters.filepath = './ch.json' CTLABELS = [] with open(filepath, 'r') as f: data = json.load(f) for key, value in data.items(): CTLABELS.append(value) def ctc_decode(rec): # ctc decoding last_char = False s = '' for c in rec: c = int(c) if c < 4135: if last_char != c: s += CTLABELS[c] last_char = c elif c == 4135: s += u'口' else: last_char = False return s def decode(rec): s = '' for c in rec: c = int(c) if c < 4135: s += CTLABELS[c] elif c == 4135: s += u'口' # print(s) return s
Other parts have not changed.
At last, the output result is:
Calculated! "E2E_RESULTS: precision: 0.0, recall: 0.0, hmean: 0" "DETECTION_ONLY_RESULTS: precision: 0.875, recall: 0.3181818181818182, hmean: 0.4666666666666667" [07/11 05:49:19 d2.engine.defaults]: Evaluation results for test in csv format: [07/11 05:49:19 d2.evaluation.testing]: copypaste: Task: E2E_RESULTS [07/11 05:49:19 d2.evaluation.testing]: copypaste: precision,recall,hmean [07/11 05:49:19 d2.evaluation.testing]: copypaste: 0.0000,0.0000,0.0000 [07/11 05:49:19 d2.evaluation.testing]: copypaste: Task: DETECTION_ONLY_RESULTS [07/11 05:49:19 d2.evaluation.testing]: copypaste: precision,recall,hmean [07/11 05:49:19 d2.evaluation.testing]: copypaste: 0.8750,0.3182,0.4667 [07/11 05:49:19 d2.utils.events]: eta: 0:00:02 iter: 19999 total_loss: 0.895 rec_loss: 0.289 loss_fcos_cls: 0.000 loss_fcos_loc: 0.009 loss_fcos_ctr: 0.596 loss_fcos_bezier: 0.002 time: 2.0044 data_time: 0.0029 lr: 0.000050 max_mem: 2667M [07/11 05:49:19 d2.engine.hooks]: Overall training speed: 19997 iterations in 11:08:05 (2.0046 s / it) [07/11 05:49:19 d2.engine.hooks]: Total training time: 11:08:28 (0:00:23 on hooks)
I noticed the iterations is 260000 in the
configs/BAText/Pretrain/attn_R_50.yaml
, myE2E_RESULTS
is 0, is that because iters not enough? 20000 iters still is 0, is that normal?
You definitely saved my life! Thank you for clarification on how to train the network on the custom dataset!
@Eurus-Holmes @Yuliang-Liu Hello, sorroy to bother you.
When I trained with ICDAR 2017 RCTW datasets, after training some iters, there is an error:"ValueError: cannot convert float NaN to integer".
@Eurus-Holmes @Yuliang-Liu Hello, sorroy to bother you.
When I trained with ICDAR 2017 RCTW datasets, after training some iters, there is an error:"ValueError: cannot convert float NaN to integer".
I have met the same question.
File "/home/appuser/.local/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 202, in _worker_loop
data = fetcher.fetch(index)
File "/home/appuser/.local/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/appuser/.local/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/appuser/.local/lib/python3.6/site-packages/detectron2/data/common.py", line 43, in __getitem__
data = self._map_func(self._dataset[cur_idx])
File "/home/appuser/.local/lib/python3.6/site-packages/detectron2/utils/serialize.py", line 23, in __call__
return self._obj(*args, **kwargs)
File "/home/site-packages/adet/data/dataset_mapper.py", line 133, in __call__
transforms = aug_input.apply_augmentations(self.augmentation)
File "/home/appuser/.local/lib/python3.6/site-packages/detectron2/data/transforms/augmentation.py", line 347, in apply_augmentations
return AugmentationList(augmentations)(self)
File "/home/appuser/.local/lib/python3.6/site-packages/detectron2/data/transforms/augmentation.py", line 264, in __call__
tfm = x(aug_input)
File "/home/appuser/.local/lib/python3.6/site-packages/detectron2/data/transforms/augmentation.py", line 165, in __call__
tfm = self.get_transform(*args)
File "/home/appuser/.local/lib/python3.6/site-packages/detectron2/data/transforms/augmentation_impl.py", line 176, in get_transform
neww = int(neww + 0.5)
ValueError: cannot convert float NaN to integer
@anruirui 我发现问题所在了,我这边问题是出现在图片太小了,crop的时候会导致有除以0的情况,因此我把crop这一块给关了,即INPUT.CROP.ENABLED: False 你可以看看你出现的问题是否跟我一样
@NO2-yh 谢谢
@Yuliang-Liu Hi, about ABCNet experimental results on CTW1500 in your paper: "Because the occupation of Chinese text in this dataset is very small, we directly regard all the Chinese text as “unseen” class during training, i.e., the 96-th class." However, if the occupation of Chinese text in one dataset is not ignored, we should enlarge the
CTLABELS
instead of:In that case, after enlarging
CTLABELS
, why could not I still recognize Chinese text in the dataset, have I missed anything else?