Experimental Results of ABCNet on English and Chinese text datasets

aim-uofa / AdelaiDet

AdelaiDet is an open source toolbox for multiple instance-level detection and recognition tasks.

https://git.io/AdelaiDet

Other

3.36k stars 643 forks source link

Experimental Results of ABCNet on English and Chinese text datasets #144

Closed Eurus-Holmes closed 4 years ago

Eurus-Holmes commented 4 years ago

@Yuliang-Liu Hi, about ABCNet experimental results on CTW1500 in your paper: "Because the occupation of Chinese text in this dataset is very small, we directly regard all the Chinese text as “unseen” class during training, i.e., the 96-th class." However, if the occupation of Chinese text in one dataset is not ignored, we should enlarge the CTLABELS instead of:

CTLABELS = [' ','!','"','#','$','%','&','\'','(',')','*','+',',','-','.','/','0','1','2','3','4','5','6','7','8','9',':',';','<','=','>','?','@','A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z','[','\\',']','^','_','`','a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z','{','|','}','~']

In that case, after enlarging CTLABELS, why could not I still recognize Chinese text in the dataset, have I missed anything else?

Yuliang-Liu commented 4 years ago

Please refer to generate_abcnet_json.py :

recs[ix] = len(cV2)

You can post what you have changed, otherwise I cannot know what did you miss.

Eurus-Holmes commented 4 years ago

@Yuliang-Liu Yes, I noticed recs[ix] = len(cV2). As for me, I set len(cV2) == len(CTLABELS) == 4135, _C.MODEL.BATEXT.VOC_SIZE == 4136, which including all character classes.

I have changed adet/config/defaults.py

# ---------------------------------------------------------------------------- #
# BAText Options
# ---------------------------------------------------------------------------- #
_C.MODEL.BATEXT.VOC_SIZE = 4136

adet/evaluation/text_evaluation.py, CTLABELS, which including all English and Chinese characters.

filepath = './ch.json'
CTLABELS = []
with open(filepath, 'r') as f:
    data = json.load(f)
    for key, value in data.items():
        CTLABELS.append(value)

def ctc_decode(rec):
    # ctc decoding
    last_char = False
    s = ''
    for c in rec:
        c = int(c)
        if c < 4135:
            if last_char != c:
                s += CTLABELS[c]
                last_char = c
        elif c == 4135:
            s += u'口'
        else:
            last_char = False
    return s

def decode(rec):
    s = ''
    for c in rec:
        c = int(c)
        if c < 4135:
            s += CTLABELS[c]
        elif c == 4135:
            s += u'口'
    # print(s)
    return s

Other parts have not changed.

At last, the output result is:

Calculated!
"E2E_RESULTS: precision: 0.0, recall: 0.0, hmean: 0"
"DETECTION_ONLY_RESULTS: precision: 0.875, recall: 0.3181818181818182, hmean: 0.4666666666666667"
[07/11 05:49:19 d2.engine.defaults]: Evaluation results for test in csv format:
[07/11 05:49:19 d2.evaluation.testing]: copypaste: Task: E2E_RESULTS
[07/11 05:49:19 d2.evaluation.testing]: copypaste: precision,recall,hmean
[07/11 05:49:19 d2.evaluation.testing]: copypaste: 0.0000,0.0000,0.0000
[07/11 05:49:19 d2.evaluation.testing]: copypaste: Task: DETECTION_ONLY_RESULTS
[07/11 05:49:19 d2.evaluation.testing]: copypaste: precision,recall,hmean
[07/11 05:49:19 d2.evaluation.testing]: copypaste: 0.8750,0.3182,0.4667
[07/11 05:49:19 d2.utils.events]:  eta: 0:00:02  iter: 19999  total_loss: 0.895  rec_loss: 0.289  loss_fcos_cls: 0.000  loss_fcos_loc: 0.009  loss_fcos_ctr: 0.596  loss_fcos_bezier: 0.002  time: 2.0044  data_time: 0.0029  lr: 0.000050  max_mem: 2667M
[07/11 05:49:19 d2.engine.hooks]: Overall training speed: 19997 iterations in 11:08:05 (2.0046 s / it)
[07/11 05:49:19 d2.engine.hooks]: Total training time: 11:08:28 (0:00:23 on hooks)

I noticed the iterations is 260000 in the configs/BAText/Pretrain/attn_R_50.yaml, my E2E_RESULTS is 0, is that because iters not enough? 20000 iters still is 0, is that normal?

Eurus-Holmes commented 4 years ago

@Yuliang-Liu After 170000 iterations, the output result is:

Calculated!
"E2E_RESULTS: precision: 0.014097456328532026, recall: 0.008024188859169671, hmean: 0.010227146403824064"
"DETECTION_ONLY_RESULTS: precision: 0.9456532842987027, recall: 0.5382602628212583, hmean: 0.6860340163782561"
[32m[07/15 09:30:10 d2.engine.defaults]: [0mEvaluation results for ReCTS_test in csv format:
[32m[07/15 09:30:10 d2.evaluation.testing]: [0mcopypaste: Task: E2E_RESULTS
[32m[07/15 09:30:10 d2.evaluation.testing]: [0mcopypaste: precision,recall,hmean
[32m[07/15 09:30:10 d2.evaluation.testing]: [0mcopypaste: 0.0141,0.0080,0.0102
[32m[07/15 09:30:10 d2.evaluation.testing]: [0mcopypaste: Task: DETECTION_ONLY_RESULTS
[32m[07/15 09:30:10 d2.evaluation.testing]: [0mcopypaste: precision,recall,hmean
[32m[07/15 09:30:10 d2.evaluation.testing]: [0mcopypaste: 0.9457,0.5383,0.6860

It seems that other reasons caused such E2E_RESULTS, have I missed anything else?

Yuliang-Liu commented 4 years ago

@Eurus-Holmes

Have you used chinese synthetic data to pretrain before finetuning on ReCTs dataset?

What is the number of batch sizes, and did you change the training scales.

Iterations are probably not enough because of large amount of classes and class imbalance of Chinese text.

The result seems normal, but the rec_loss is still high, suggesting it hadn't converged very well. Also, can you also visualize some results and post here?

Eurus-Holmes commented 4 years ago

@Yuliang-Liu I have not used Chinese synthetic data to pretrain. Currently, my IMS_PER_BATCH is 16, datasets include 17k train images and 3k test images, 8 GPU to train.

I have changed BASE_LR to 0.005, after 140000 iters, the output results is:

[07/16 06:37:29] d2.engine.defaults INFO: Evaluation results for ReCTS_test in csv format:
[07/16 06:37:29] d2.evaluation.testing INFO: copypaste: Task: E2E_RESULTS
[07/16 06:37:29] d2.evaluation.testing INFO: copypaste: precision,recall,hmean
[07/16 06:37:29] d2.evaluation.testing INFO: copypaste: 0.1141,0.0944,0.1033
[07/16 06:37:29] d2.evaluation.testing INFO: copypaste: Task: DETECTION_ONLY_RESULTS
[07/16 06:37:29] d2.evaluation.testing INFO: copypaste: precision,recall,hmean
[07/16 06:37:29] d2.evaluation.testing INFO: copypaste: 0.8816,0.7293,0.7983

I think more iterations will work, thanks for your help!

Eurus-Holmes commented 4 years ago

@Yuliang-Liu Hi, I have done 300000 iters, but the final result is not ideal:

Calculated!
"E2E_RESULTS: precision: 0.11593286988273295, recall: 0.09599953482963135, hmean: 0.1050287859028595"
"DETECTION_ONLY_RESULTS: precision: 0.8897549329401025, recall: 0.7367717176415862, hmean: 0.806068895321098"
[32m[07/17 06:10:36 d2.engine.defaults]: [0mEvaluation results for ReCTS_test in csv format:
[32m[07/17 06:10:36 d2.evaluation.testing]: [0mcopypaste: Task: E2E_RESULTS
[32m[07/17 06:10:36 d2.evaluation.testing]: [0mcopypaste: precision,recall,hmean
[32m[07/17 06:10:36 d2.evaluation.testing]: [0mcopypaste: 0.1159,0.0960,0.1050
[32m[07/17 06:10:36 d2.evaluation.testing]: [0mcopypaste: Task: DETECTION_ONLY_RESULTS
[32m[07/17 06:10:36 d2.evaluation.testing]: [0mcopypaste: precision,recall,hmean
[32m[07/17 06:10:36 d2.evaluation.testing]: [0mcopypaste: 0.8898,0.7368,0.8061
[32m[07/17 06:10:36 d2.utils.events]: [0m eta: 0:00:00  iter: 299999  total_loss: 0.945  rec_loss: 0.282  loss_fcos_cls: 0.009  loss_fcos_loc: 0.044  loss_fcos_ctr: 0.600  loss_fcos_bezier: 0.009  time: 0.5128  data_time: 0.0285  lr: 0.000500  max_mem: 7644M
[32m[07/17 06:10:36 d2.engine.hooks]: [0mOverall training speed: 299997 iterations in 1 day, 18:43:56 (0.5128 s / it)
[32m[07/17 06:10:36 d2.engine.hooks]: [0mTotal training time: 1 day, 19:32:42 (0:48:45 on hooks)

And I noticed rec_loss is very strange:

I can't imagine that it actually changes periodically and clearly it hadn't converged normally. What happened to it?

What's more, it seems that the current code doesn't set the convergence condition, but only sets max_iter. I think that more convergence conditions should be added to help train.

Eurus-Holmes commented 4 years ago

@Yuliang-Liu Another question: at the output folder, the model_final.pth is the best model or the last model? If it is the best model, then under what condition is it saved? precision, recall, or hmean? It seems no relevant code to explain this.

Yuliang-Liu commented 4 years ago

@Eurus-Holmes Q1: Try using a smaller learning rate? only max_iter

Q2: Last model. To avoid using test set for validation purpose.

Eurus-Holmes commented 4 years ago

@Yuliang-Liu Thanks! I'll try smaller lr, but if only max_iter, how can I know my model is the best result? In other words, how can I know the algorithm already converge? And that's why I want to ask model_final.pth is the best model or the last model. Actually I have not found the code about model_final.pth is saved.

Yuliang-Liu commented 4 years ago

@Eurus-Holmes It's all based on empiricism.

Eurus-Holmes commented 4 years ago

@Eurus-Holmes It's all based on empiricism.

Fine, thanks.

Yuliang-Liu commented 4 years ago

@Eurus-Holmes I synthesized 130k bilingual images (including Chinese and English). Train by single 2080 ti for one day. Here are the results on validation set. (Good in validation set, but poor to generalize to real data)

The rec_loss is similar to yours (about 0.3). I will test the quantitative results; use real data; and try to train it for a long schedule. I will share what I have found here and it would be much appreciated that you can also share your experiences.

Eurus-Holmes commented 4 years ago

@Yuliang-Liu Hi, thanks! That will be very helpful! I'll reopen this issue and share my experiences.

barry-0214 commented 4 years ago

@Yuliang-Liu I made a data set labeled in Chinese by labelme.

First, I use python Bezier_generator2.py .error UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb5 in position 81: invalid start byte so I have changed python Bezier_generator2.py line 170 : with open(label,"r") as f: to with open(label,"r",encoding='gbk') as f:

There is an error when I using 'python generate_abcnet_json.py' ,all the chinese 'rec' in train.json is shown as 102 (102=len(cV2) ,but I have changed cV2 to my own Chinese data),have I missed anything else?

Eurus-Holmes commented 4 years ago

@Yuliang-Liu Hi, I have changed my lr to 0.0001, however, the recognition module still not get a good result.

After 500000 iters,

Calculated!

"E2E_RESULTS: precision: 0.11384842585755521, recall: 0.09861611815327363, hmean: 0.10568624396323414"

"DETECTION_ONLY_RESULTS: precision: 0.8717862656910788, recall: 0.7551459472031632, hmean: 0.8092849353481851"

[07/20 11:06:57 d2.engine.defaults]: Evaluation results for ReCTS_test in csv format:

[07/20 11:06:57 d2.evaluation.testing]: copypaste: Task: E2E_RESULTS

[07/20 11:06:57 d2.evaluation.testing]: copypaste: precision,recall,hmean

[07/20 11:06:57 d2.evaluation.testing]: copypaste: 0.1138,0.0986,0.1057

[07/20 11:06:57 d2.evaluation.testing]: copypaste: Task: DETECTION_ONLY_RESULTS

[07/20 11:06:57 d2.evaluation.testing]: copypaste: precision,recall,hmean

[07/20 11:06:57 d2.evaluation.testing]: copypaste: 0.8718,0.7551,0.8093

At last,

total_loss: 0.891  rec_loss: 0.249  loss_fcos_cls: 0.002  loss_fcos_loc: 0.035 loss_fcos_ctr: 0.603  loss_fcos_bezier: 0.007  time: 0.5092  data_time: 0.0271  lr: 0.000100  max_mem: 7386M

I visualized the E2E_RESULTS hmean curve:

It seems like it has converged, but why the hmean value is so poor?

Some test results on real data (from ICDAR 2019 datasets): train_ReCTS_000001 train_ReCTS_000002

Eurus-Holmes commented 4 years ago

@Yuliang-Liu I made a data set labeled in Chinese by labelme.

First, I use python Bezier_generator2.py .error UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb5 in position 81: invalid start byte so I have changed python Bezier_generator2.py line 170 : with open(label,"r") as f: to with open(label,"r",encoding='gbk') as f:

There is an error when I using 'python generate_abcnet_json.py' ,all the chinese 'rec' in train.json is shown as 102 (102=len(cV2) ,but I have changed cV2 to my own Chinese data),have I missed anything else?

It seems that there is a problem with your encoding format.

Yuliang-Liu commented 4 years ago

@Eurus-Holmes I achieve the similar results to yours. There are four possible reasons:

Metric. Evaluating Chinese recognition performance usually adopts 1-NED metric, but we use word accuracy in this implementation, i.e., one character error will result in a false positive. For example, “百鸭传奇” is not matched to "匠鸭传奇“ so it is a false positive.
Decoder. Attention mechanism is known to good at learning the semantic context [1], which is better than CTC on English but worse on Chinese based on our previous experiences. ReCT dataset also provides character-level bounding box which should be useful for Chinese recognition [2].

[1] Wan, Zhaoyi, et al. "On Vocabulary Reliance in Scene Text Recognition." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020.

[2] Xing L, Tian Z, Huang W, et al. Convolutional character networks[C]//Proceedings of the IEEE International Conference on Computer Vision. 2019: 9126-9136.

Sequence. Have you checked the annotating sequence for "青椒" like instance. I mean, are the interpolating points on the short side or long side. ReCT has many such data, which might be a problem.
Data. As Chinese text recognition problem is much difficult than English one, the number of the training data may not be enough. We usually trained an independent recognition model with a significantly large number of data.

lzneu commented 4 years ago

@Eurus-Holmes I achieve the similar results to yours. There are four possible reasons:

Metric. Evaluating Chinese recognition performance usually adopts 1-NED metric, but we use word accuracy in this implementation, i.e., one character error will result in a false positive. For example, “百鸭传奇” is not matched to "匠鸭传奇“ so it is a false positive.

Decoder. Attention mechanism is known to good at learning the semantic context [1], which is better than CTC on English but worse on Chinese based on our previous experiences. ReCT dataset also provides character-level bounding box which should be useful for Chinese recognition [2].

[1] Wan, Zhaoyi, et al. "On Vocabulary Reliance in Scene Text Recognition." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020.

[2] Xing L, Tian Z, Huang W, et al. Convolutional character networks[C]//Proceedings of the IEEE International Conference on Computer Vision. 2019: 9126-9136.

Sequence. Have you checked the annotating sequence for "青椒" like instance. I mean, are the interpolating points on the short side or long side. ReCT has many such data, which might be a problem.

Data. As Chinese text recognition problem is much difficult than English one, the number of the training data may not be enough. We usually trained an independent recognition model with a significantly large number of data.

Hi @Yuliang-Liu .I have the same problem like 4 DATA, the recognition branch is not robust enough for Chinese recognition. So I want to add an independent recognition model and use the output of BezierAlign, but my output of BezierAlign is as follows:

and the origin is:

I want the results like your paper to deal curve text:

Could your give me some advices? Thanks.

Yuliang-Liu commented 4 years ago

@lzneu We had provided a BezierAlign example there. You can follow the same way to create one based on this version.

Eurus-Holmes commented 4 years ago

@Eurus-Holmes I achieve the similar results to yours. There are four possible reasons:

Metric. Evaluating Chinese recognition performance usually adopts 1-NED metric, but we use word accuracy in this implementation, i.e., one character error will result in a false positive. For example, “百鸭传奇” is not matched to "匠鸭传奇“ so it is a false positive.

Decoder. Attention mechanism is known to good at learning the semantic context [1], which is better than CTC on English but worse on Chinese based on our previous experiences. ReCT dataset also provides character-level bounding box which should be useful for Chinese recognition [2].

[1] Wan, Zhaoyi, et al. "On Vocabulary Reliance in Scene Text Recognition." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020.

[2] Xing L, Tian Z, Huang W, et al. Convolutional character networks[C]//Proceedings of the IEEE International Conference on Computer Vision. 2019: 9126-9136.

Sequence. Have you checked the annotating sequence for "青椒" like instance. I mean, are the interpolating points on the short side or long side. ReCT has many such data, which might be a problem.

Data. As Chinese text recognition problem is much difficult than English one, the number of the training data may not be enough. We usually trained an independent recognition model with a significantly large number of data.

@Yuliang-Liu Thanks for your advice so much! I'll make some improvements based on these possible reasons.

lzneu commented 4 years ago

@lzneu We had provided a BezierAlign example there. You can follow the same way to create one based on this version.

Thanks

Eurus-Holmes commented 4 years ago

@Eurus-Holmes I achieve the similar results to yours. There are four possible reasons:

Metric. Evaluating Chinese recognition performance usually adopts 1-NED metric, but we use word accuracy in this implementation, i.e., one character error will result in a false positive. For example, “百鸭传奇” is not matched to "匠鸭传奇“ so it is a false positive.

Decoder. Attention mechanism is known to good at learning the semantic context [1], which is better than CTC on English but worse on Chinese based on our previous experiences. ReCT dataset also provides character-level bounding box which should be useful for Chinese recognition [2].

[1] Wan, Zhaoyi, et al. "On Vocabulary Reliance in Scene Text Recognition." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020.

[2] Xing L, Tian Z, Huang W, et al. Convolutional character networks[C]//Proceedings of the IEEE International Conference on Computer Vision. 2019: 9126-9136.

Sequence. Have you checked the annotating sequence for "青椒" like instance. I mean, are the interpolating points on the short side or long side. ReCT has many such data, which might be a problem.

Data. As Chinese text recognition problem is much difficult than English one, the number of the training data may not be enough. We usually trained an independent recognition model with a significantly large number of data.

@Yuliang-Liu According to these possible reasons, I made some change.

I added the 1-NED metric,


def cal_sim(str1, str2):
    """
    Normalized Edit Distance metric (1-N.E.D specifically)
    """
    m = len(str1) + 1
    n = len(str2) + 1
    matrix = np.zeros((m, n))
    for i in range(m):
        matrix[i][0] = i

    for j in range(n):
        matrix[0][j] = j

    for i in range(1, m):
        for j in range(1, n):
            if str1[i - 1] == str2[j - 1]:
                matrix[i][j] = matrix[i - 1][j - 1]
            else:
                matrix[i][j] = min(matrix[i - 1][j - 1], min(matrix[i][j - 1], matrix[i - 1][j])) + 1

    lev = matrix[m - 1][n - 1]
    if (max(m - 1, n - 1)) == 0:
        sim = 1.0
    else:
        sim = 1.0 - lev / (max(m - 1, n - 1))
    return sim

def include_in_dictionary_transcription(transcription): ... ... ... matchedSum = 0 det_only_matchedSum = 0

Rectangle = namedtuple('Rectangle', 'xmin ymin xmax ymax')

gt = rrc_evaluation_funcs.load_zip_file(gtFilePath,evaluationParams['GT_SAMPLE_NAME_2_ID'])
subm = rrc_evaluation_funcs.load_zip_file(submFilePath,evaluationParams['DET_SAMPLE_NAME_2_ID'],True)

numGlobalCareGt = 0;
numGlobalCareDet = 0;
det_only_numGlobalCareGt = 0;
det_only_numGlobalCareDet = 0;

arrGlobalConfidences = [];
arrGlobalMatches = [];

Normalized_ED = 0
total_num = 0

            for gtNum in range(len(gtPols)):
                for detNum in range(len(detPols)):
                    if gtRectMat[gtNum] == 0 and detRectMat[detNum] == 0 and gtNum not in gtDontCarePolsNum and detNum not in detDontCarePolsNum :
                        if iouMat[gtNum,detNum]>evaluationParams['IOU_CONSTRAINT']:
                            gtRectMat[gtNum] = 1
                            detRectMat[detNum] = 1
                            #detection matched only if transcription is equal
                            # det_only_correct = True
                            # detOnlyCorrect += 1
                            if evaluationParams['WORD_SPOTTING']:
                                edd = lstn.distance(gtTrans[gtNum].upper(), detTrans[detNum].upper())
                                if edd<=0: 
                                    correct = True
                                else:
                                    correct = False
                                # correct = gtTrans[gtNum].upper() == detTrans[detNum].upper()
                            else:
                                try:
                                    correct = transcription_match(gtTrans[gtNum].upper(),detTrans[detNum].upper(),evaluationParams['SPECIAL_CHARACTERS'],evaluationParams['ONLY_REMOVE_FIRST_LAST_CHARACTER'])==True
                                    sim = cal_sim(detTrans[detNum].upper(), gtTrans[gtNum].upper())
                                    Normalized_ED += sim
                                except: # empty
                                    correct = False
                            detCorrect += (1 if correct else 0)
                            if correct:
                                detMatchedNums.append(detNum)

Normalized_ED = 0 if Normalized_ED == 0 else float(Normalized_ED) / total_num


However, the 1-NED result is similar to `hmean`,

<img width="679" alt="Screen Shot 2020-07-22 at 3 38 56 PM" src="https://user-images.githubusercontent.com/34226570/88148644-7d8bd000-cc31-11ea-9265-2a965e9fc3e3.png">

2. I changed the Decoder to the CTC module, it seems that it is not better than `attention`,

"E2E_RESULTS: precision: 0.00013580498404291438, recall: 0.00011629259216187928, hmean: 0.00012529365700861392"

"DETECTION_ONLY_RESULTS: precision: 0.8812385414544713, recall: 0.7546226305384347, hmean: 0.8130305403288959"

[07/22 07:43:00 d2.engine.defaults]: Evaluation results for ReCTS_test in csv format:

[07/22 07:43:00 d2.evaluation.testing]: copypaste: Task: E2E_RESULTS

[07/22 07:43:00 d2.evaluation.testing]: copypaste: precision,recall,hmean

[07/22 07:43:00 d2.evaluation.testing]: copypaste: 0.0001,0.0001,0.0001

[07/22 07:43:00 d2.evaluation.testing]: copypaste: Task: DETECTION_ONLY_RESULTS

[07/22 07:43:00 d2.evaluation.testing]: copypaste: precision,recall,hmean

[07/22 07:43:00 d2.evaluation.testing]: copypaste: 0.8812,0.7546,0.8130

[07/22 07:43:00 d2.utils.events]: eta: 10:01:18 iter: 190999 total_loss: 1.071 rec_loss: 0.425 loss_fcos_cls: 0.002 loss_fcos_loc: 0.035 loss_fcos_ctr: 0.600 loss_fcos_bezier: 0.006 time: 0.3349 data_time: 0.0297 lr: 0.000010 max_mem: 5325M

Yuliang-Liu commented 4 years ago

@Eurus-Holmes You can use lstn.distance(str1,str2) to calculate edit distance.

What is the x-axis represents in your figure, and why 1-NED is lower than word accuracy?

For example, given str1='he' and str2='she', according to 1-NED metric: 1 - ed(str1,str2) / max[length(str1), length(str2)], the 1-NED score is 0.67. For word accuracy, the hmean is zero. In this case, 1-NED is the final result so you don't have to calculate another hmean (I guess the hmean you posted is not from word accuracy?).

There might be some problems of current ctc module in our implementation - shouldn't be so bad.

To improve the recognition performance, you can also try using a deeper recognition branch, and only updating the recognition branch by using the existing Chinese data similar to below figure.(I haven't try it yet)

0000778

Eurus-Holmes commented 4 years ago

@Yuliang-Liu Thanks for your advice! The x-axis represents each checkpoint result, I set CHECKPOINT_PERIOD: 1000, which is output after each inference. I have not changed hmean calculate method, which is from your implementation, but I set self._word_spotting = False.

Then I set self._word_spotting = True, it seems that could achieve better results.

Calculated!

"E2E_RESULTS: precision: 0.36590131900341966, recall: 0.42971887550200805, hmean: 0.39525065963060685

"DETECTION_ONLY_RESULTS: precision: 0.8821482189163369, recall: 0.7516571694383067, hmean: 0.8116915735275649"

[07/23 06:07:32 d2.engine.defaults]: Evaluation results for ReCTS_test in csv format:

[07/23 06:07:32 d2.evaluation.testing]: copypaste: Task: E2E_RESULTS

[07/23 06:07:32 d2.evaluation.testing]: copypaste: precision,recall,hmean

[07/23 06:07:32 d2.evaluation.testing]: copypaste: 0.3659,0.4297,0.3953

[07/23 06:07:32 d2.evaluation.testing]: copypaste: Task: DETECTION_ONLY_RESULTS

[07/23 06:07:32 d2.evaluation.testing]: copypaste: precision,recall,hmean

[07/23 06:07:32 d2.evaluation.testing]: copypaste: 0.8821,0.7517,0.8117

[07/23 06:07:32 d2.utils.events]:  eta: 1 day, 2:32:25  iter: 111999  total_loss: 0.876  rec_loss: 0.225  loss_fcos_cls: 0.003  loss_fcos_loc: 0.036  loss_fcos_ctr: 0.602  loss_fcos_bezier: 0.007  time: 0.5125  data_time: 0.0270  lr: 0.000001  max_mem: 7233M

About the recognition branch, the deeper network should be work. But I'm a little confused about only updating the recognition branch, are you mean only using the recognition branch to train and then get a model as WEIGHTS to train again the whole network?

Yuliang-Liu commented 4 years ago

@Eurus-Holmes

I have not changed hmean calculate method, which is from your implementation,

If you use 1-NED, you don't need to use Hmean. That is, average of sum of 1-NED scores of all instances is the final result.

are you mean only using the recognition branch to train and then get a model as WEIGHTS to train again the whole network?

No, what I mean is you still train the whole model end-to-end like you did before. Only difference is to froze the weights of the backbone and detection branch.

Eurus-Holmes commented 4 years ago

@Eurus-Holmes

I have not changed hmean calculate method, which is from your implementation,

If you use 1-NED, you don't need to use Hmean. That is, average of sum of 1-NED scores of all instances is the final result.

are you mean only using the recognition branch to train and then get a model as WEIGHTS to train again the whole network?

No, what I mean is you still train the whole model end-to-end like you did before. Only difference is to froze the weights of the backbone and detection branch.

@Yuliang-Liu Thanks for your advice about the recognition branch! But I'm still a little confused about 1-NED metric, my code as following:


                                if evaluationParams['WORD_SPOTTING']:
                                    edd = lstn.distance(gtTrans[gtNum].upper(), detTrans[detNum].upper())
                                    if edd<=0:
                                        correct = True
                                    else:
                                        correct = False
                                    # correct = gtTrans[gtNum].upper() == detTrans[detNum].upper()
                                    Normalized_ED += edd / max(len(gtTrans[gtNum].upper()), len(detTrans[detNum].upper()))
                                else:
                                    try:
                                        correct = transcription_match(gtTrans[gtNum].upper(),detTrans[detNum].upper(),evaluationParams['SPECIAL_CHARACTERS'],evaluationParams['ONLY_REMOVE_FIRST_LAST_CHARACTER'])==True

                                        Normalized_ED += edd / max(len(gtTrans[gtNum].upper()), len(detTrans[detNum].upper()))
                                    except: # empty
                                        correct = False
                                detCorrect += (1 if correct else 0)

...
...
...

Normalized_ED = 0 if Normalized_ED == 0 else 1 - float(Normalized_ED) / total_num

But the last result is:

[07/25 16:07:22] d2.evaluation.testing INFO: copypaste: Task: E2E_RESULTS
[07/25 16:07:22] d2.evaluation.testing INFO: copypaste: precision,recall,hmean,Normalized_ED
[07/25 16:07:22] d2.evaluation.testing INFO: copypaste: 0.2496,0.2903,0.2684,0.9856
[07/25 16:07:22] d2.evaluation.testing INFO: copypaste: Task: DETECTION_ONLY_RESULTS
[07/25 16:07:22] d2.evaluation.testing INFO: copypaste: precision,recall,hmean
[07/25 16:07:22] d2.evaluation.testing INFO: copypaste: 0.8824,0.7510,0.8114
[07/25 16:07:22] d2.utils.events INFO:  eta: 0:00:00  iter: 299999  total_loss: 1.088  rec_loss: 0.451  loss_fcos_cls: 0.003  loss_fcos_loc: 0.037  loss_fcos_ctr: 0.600  loss_fcos_bezier: 0.006  time: 0.5275  data_time: 0.0287  lr: 0.000001  max_mem: 8354M

Why Normalized_ED could achieve 0.9856? But hmean is only 0.2684? So strange.

Eurus-Holmes commented 4 years ago

@Yuliang-Liu

What's more, as for self._word_spotting, what exactly the difference between True and False?

From adet/evaluation/text_eval_script.py:

                                if evaluationParams['WORD_SPOTTING']:
                                    edd = lstn.distance(gtTrans[gtNum].upper(), detTrans[detNum].upper())
                                    if edd<=0: 
                                        correct = True
                                    else:
                                        correct = False
                                    # correct = gtTrans[gtNum].upper() == detTrans[detNum].upper()
                                else:
                                    try:
                                        correct = transcription_match(gtTrans[gtNum].upper(),detTrans[detNum].upper(),evaluationParams['SPECIAL_CHARACTERS'],evaluationParams['ONLY_REMOVE_FIRST_LAST_CHARACTER'])==True
                                    except: # empty
                                        correct = False
                                detCorrect += (1 if correct else 0)

When edd<=0, isn't it gtTrans[gtNum].upper() match the detTrans[detNum].upper() completely? That should be same with transcription_match, right?

I evaluted the json result file offline following the evaluation_example_scripts, with self._word_spotting = True and self._word_spotting = False, seperately.

The results as following:

self._word_spotting = True

Calculated!
"E2E_RESULTS: precision: 0.3152732368052312, recall: 0.387263339070568, hmean: 0.34757981462409887"
"DETECTION_ONLY_RESULTS: precision: 0.8746584927034051, recall: 0.7632282823584138, hmean: 0.8151529265641981"
OrderedDict([('E2E_RESULTS', {'precision': 0.3152732368052312, 'recall': 0.387263339070568, 'hmean': 0.34757981462409887}), ('DETECTION_ONLY_RESULTS', {'precision': 0.8746584927034051, 'recall': 0.7632282823584138, 'hmean': 0.8151529265641981})])

self._word_spotting = False

Calculated!
"E2E_RESULTS: precision: 0.3386419670820284, recall: 0.29549947668333526, hmean: 0.31560316721006054"
"DETECTION_ONLY_RESULTS: precision: 0.8746584927034051, recall: 0.7632282823584138, hmean: 0.8151529265641981"
OrderedDict([('E2E_RESULTS', {'precision': 0.3386419670820284, 'recall': 0.29549947668333526, 'hmean': 0.31560316721006054}), ('DETECTION_ONLY_RESULTS', {'precision': 0.8746584927034051, 'recall': 0.7632282823584138, 'hmean': 0.8151529265641981})])

Yuliang-Liu commented 4 years ago

@Eurus-Holmes edd is not defined in the else:. Also, are sure \ operator for edd is calculated to output a float number? no int devide by int?

Why Normalized_ED could achieve 0.9856? But hmean is only 0.2684? So strange. Please print some examples to see why they are so different.

The implementation of wordspotting is from ICDAR 2015 official evaluation code. You should be able to figure out the difference in the code.

Eurus-Holmes commented 4 years ago

@Yuliang-Liu

About edd, do you mean:

                                if evaluationParams['WORD_SPOTTING']:
                                    edd = lstn.distance(gtTrans[gtNum].upper(), detTrans[detNum].upper())
                                    if edd<=0:
                                        correct = True
                                    else:
                                        correct = False
                                    # correct = gtTrans[gtNum].upper() == detTrans[detNum].upper()
                                    Normalized_ED += float(edd) / max(len(gtTrans[gtNum].upper()), len(detTrans[detNum].upper()))
                                else:
                                    edd = lstn.distance(gtTrans[gtNum].upper(), detTrans[detNum].upper())
                                    try:
                                        correct = transcription_match(gtTrans[gtNum].upper(),detTrans[detNum].upper(),evaluationParams['SPECIAL_CHARACTERS'],evaluationParams['ONLY_REMOVE_FIRST_LAST_CHARACTER'])==True

                                    except: # empty
                                        correct = False
                                    Normalized_ED += float(edd) / max(len(gtTrans[gtNum].upper()), len(detTrans[detNum].upper()))

About wordspotting, could you kindly provide the ICDAR 2015 official evaluation code link? I don't find it from ICDAR 2015 homepage, but from adet/evaluation/text_eval_script.py, it seems indeed no difference for True or False.

youngboy52 commented 4 years ago

Hi, I use this repo to detect and recognize the Chinese word in my own dataset, and I have revised the code following the discussion above. However, I found that it can detect the location of word exactly, but it returns the wrong recognized results (almost all the recognized results are wrong). The sample image is shown as follow: session6589_0f1840347a019d09c6d22fa78b156975_thumb Do I need to adjust the additional configs? such as the POOLER_RESOLUTION, POOLER_SCALES, CANONICAL_SIZE. Looking forward to your reply!

Gavin666Github commented 3 years ago

@youngboy52 您好，处理中文数据，需要修改哪些地方啊，可否说下？您这个VOC_SIZE=6804，是包含了6803个中文汉字吗，自己整理出的吗这个

youngboy52 commented 3 years ago

@Eurus-Holmes I achieve the similar results to yours. There are four possible reasons:

Metric. Evaluating Chinese recognition performance usually adopts 1-NED metric, but we use word accuracy in this implementation, i.e., one character error will result in a false positive. For example, “百鸭传奇” is not matched to "匠鸭传奇“ so it is a false positive.

Decoder. Attention mechanism is known to good at learning the semantic context [1], which is better than CTC on English but worse on Chinese based on our previous experiences. ReCT dataset also provides character-level bounding box which should be useful for Chinese recognition [2].

[1] Wan, Zhaoyi, et al. "On Vocabulary Reliance in Scene Text Recognition." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020. [2] Xing L, Tian Z, Huang W, et al. Convolutional character networks[C]//Proceedings of the IEEE International Conference on Computer Vision. 2019: 9126-9136.

Sequence. Have you checked the annotating sequence for "青椒" like instance. I mean, are the interpolating points on the short side or long side. ReCT has many such data, which might be a problem.

Data. As Chinese text recognition problem is much difficult than English one, the number of the training data may not be enough. We usually trained an independent recognition model with a significantly large number of data.

@Yuliang-Liu According to these possible reasons, I made some change.

I added the 1-NED metric,

    def cal_sim(str1, str2):
        """
        Normalized Edit Distance metric (1-N.E.D specifically)
        """
        m = len(str1) + 1
        n = len(str2) + 1
        matrix = np.zeros((m, n))
        for i in range(m):
            matrix[i][0] = i

        for j in range(n):
            matrix[0][j] = j

        for i in range(1, m):
            for j in range(1, n):
                if str1[i - 1] == str2[j - 1]:
                    matrix[i][j] = matrix[i - 1][j - 1]
                else:
                    matrix[i][j] = min(matrix[i - 1][j - 1], min(matrix[i][j - 1], matrix[i - 1][j])) + 1

        lev = matrix[m - 1][n - 1]
        if (max(m - 1, n - 1)) == 0:
            sim = 1.0
        else:
            sim = 1.0 - lev / (max(m - 1, n - 1))
        return sim

def include_in_dictionary_transcription(transcription):
...
...
...
  matchedSum = 0
    det_only_matchedSum = 0

    Rectangle = namedtuple('Rectangle', 'xmin ymin xmax ymax')

    gt = rrc_evaluation_funcs.load_zip_file(gtFilePath,evaluationParams['GT_SAMPLE_NAME_2_ID'])
    subm = rrc_evaluation_funcs.load_zip_file(submFilePath,evaluationParams['DET_SAMPLE_NAME_2_ID'],True)

    numGlobalCareGt = 0;
    numGlobalCareDet = 0;
    det_only_numGlobalCareGt = 0;
    det_only_numGlobalCareDet = 0;

    arrGlobalConfidences = [];
    arrGlobalMatches = [];

    Normalized_ED = 0
    total_num = 0

                for gtNum in range(len(gtPols)):
                    for detNum in range(len(detPols)):
                        if gtRectMat[gtNum] == 0 and detRectMat[detNum] == 0 and gtNum not in gtDontCarePolsNum and detNum not in detDontCarePolsNum :
                            if iouMat[gtNum,detNum]>evaluationParams['IOU_CONSTRAINT']:
                                gtRectMat[gtNum] = 1
                                detRectMat[detNum] = 1
                                #detection matched only if transcription is equal
                                # det_only_correct = True
                                # detOnlyCorrect += 1
                                if evaluationParams['WORD_SPOTTING']:
                                    edd = lstn.distance(gtTrans[gtNum].upper(), detTrans[detNum].upper())
                                    if edd<=0: 
                                        correct = True
                                    else:
                                        correct = False
                                    # correct = gtTrans[gtNum].upper() == detTrans[detNum].upper()
                                else:
                                    try:
                                        correct = transcription_match(gtTrans[gtNum].upper(),detTrans[detNum].upper(),evaluationParams['SPECIAL_CHARACTERS'],evaluationParams['ONLY_REMOVE_FIRST_LAST_CHARACTER'])==True
                                        sim = cal_sim(detTrans[detNum].upper(), gtTrans[gtNum].upper())
                                        Normalized_ED += sim
                                    except: # empty
                                        correct = False
                                detCorrect += (1 if correct else 0)
                                if correct:
                                    detMatchedNums.append(detNum)

Normalized_ED = 0 if Normalized_ED == 0 else float(Normalized_ED) / total_num

However, the 1-NED result is similar to hmean,

I changed the Decoder to the CTC module, it seems that it is not better than attention,

"E2E_RESULTS: precision: 0.00013580498404291438, recall: 0.00011629259216187928, hmean: 0.00012529365700861392"

"DETECTION_ONLY_RESULTS: precision: 0.8812385414544713, recall: 0.7546226305384347, hmean: 0.8130305403288959"

[07/22 07:43:00 d2.engine.defaults]: Evaluation results for ReCTS_test in csv format:

[07/22 07:43:00 d2.evaluation.testing]: copypaste: Task: E2E_RESULTS

[07/22 07:43:00 d2.evaluation.testing]: copypaste: precision,recall,hmean

[07/22 07:43:00 d2.evaluation.testing]: copypaste: 0.0001,0.0001,0.0001

[07/22 07:43:00 d2.evaluation.testing]: copypaste: Task: DETECTION_ONLY_RESULTS

[07/22 07:43:00 d2.evaluation.testing]: copypaste: precision,recall,hmean

[07/22 07:43:00 d2.evaluation.testing]: copypaste: 0.8812,0.7546,0.8130

[07/22 07:43:00 d2.utils.events]:  eta: 10:01:18  iter: 190999  total_loss: 1.071  rec_loss: 0.425  loss_fcos_cls: 0.002  loss_fcos_loc: 0.035  loss_fcos_ctr: 0.600  loss_fcos_bezier: 0.006  time: 0.3349  data_time: 0.0297  lr: 0.000010  max_mem: 5325M

Hi, I met a situation that CTC decoder always predict the blank label during the model training, and the rec_loss looks normal in several iterations while it drops so quickly in one iteration: Did you meet this issue? Looking forward to your reply!

yustiks commented 3 years ago

@Yuliang-Liu Yes, I noticed recs[ix] = len(cV2). As for me, I set len(cV2) == len(CTLABELS) == 4135, _C.MODEL.BATEXT.VOC_SIZE == 4136, which including all character classes.

I have changed adet/config/defaults.py

# ---------------------------------------------------------------------------- #
# BAText Options
# ---------------------------------------------------------------------------- #
_C.MODEL.BATEXT.VOC_SIZE = 4136

adet/evaluation/text_evaluation.py, CTLABELS, which including all English and Chinese characters.

filepath = './ch.json'
CTLABELS = []
with open(filepath, 'r') as f:
    data = json.load(f)
    for key, value in data.items():
        CTLABELS.append(value)

def ctc_decode(rec):
    # ctc decoding
    last_char = False
    s = ''
    for c in rec:
        c = int(c)
        if c < 4135:
            if last_char != c:
                s += CTLABELS[c]
                last_char = c
        elif c == 4135:
            s += u'口'
        else:
            last_char = False
    return s

def decode(rec):
    s = ''
    for c in rec:
        c = int(c)
        if c < 4135:
            s += CTLABELS[c]
        elif c == 4135:
            s += u'口'
    # print(s)
    return s

Other parts have not changed.

At last, the output result is:

Calculated!
"E2E_RESULTS: precision: 0.0, recall: 0.0, hmean: 0"
"DETECTION_ONLY_RESULTS: precision: 0.875, recall: 0.3181818181818182, hmean: 0.4666666666666667"
[07/11 05:49:19 d2.engine.defaults]: Evaluation results for test in csv format:
[07/11 05:49:19 d2.evaluation.testing]: copypaste: Task: E2E_RESULTS
[07/11 05:49:19 d2.evaluation.testing]: copypaste: precision,recall,hmean
[07/11 05:49:19 d2.evaluation.testing]: copypaste: 0.0000,0.0000,0.0000
[07/11 05:49:19 d2.evaluation.testing]: copypaste: Task: DETECTION_ONLY_RESULTS
[07/11 05:49:19 d2.evaluation.testing]: copypaste: precision,recall,hmean
[07/11 05:49:19 d2.evaluation.testing]: copypaste: 0.8750,0.3182,0.4667
[07/11 05:49:19 d2.utils.events]:  eta: 0:00:02  iter: 19999  total_loss: 0.895  rec_loss: 0.289  loss_fcos_cls: 0.000  loss_fcos_loc: 0.009  loss_fcos_ctr: 0.596  loss_fcos_bezier: 0.002  time: 2.0044  data_time: 0.0029  lr: 0.000050  max_mem: 2667M
[07/11 05:49:19 d2.engine.hooks]: Overall training speed: 19997 iterations in 11:08:05 (2.0046 s / it)
[07/11 05:49:19 d2.engine.hooks]: Total training time: 11:08:28 (0:00:23 on hooks)

I noticed the iterations is 260000 in the configs/BAText/Pretrain/attn_R_50.yaml, my E2E_RESULTS is 0, is that because iters not enough? 20000 iters still is 0, is that normal?

You definitely saved my life! Thank you for clarification on how to train the network on the custom dataset!

anruirui commented 3 years ago

@Eurus-Holmes @Yuliang-Liu Hello, sorroy to bother you.

When I trained with ICDAR 2017 RCTW datasets, after training some iters, there is an error:"ValueError: cannot convert float NaN to integer".

NO2-yh commented 3 years ago

@Eurus-Holmes @Yuliang-Liu Hello, sorroy to bother you.

When I trained with ICDAR 2017 RCTW datasets, after training some iters, there is an error:"ValueError: cannot convert float NaN to integer".

I have met the same question.


  File "/home/appuser/.local/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 202, in _worker_loop
    data = fetcher.fetch(index)

  File "/home/appuser/.local/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]

  File "/home/appuser/.local/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]

  File "/home/appuser/.local/lib/python3.6/site-packages/detectron2/data/common.py", line 43, in __getitem__
    data = self._map_func(self._dataset[cur_idx])

  File "/home/appuser/.local/lib/python3.6/site-packages/detectron2/utils/serialize.py", line 23, in __call__
    return self._obj(*args, **kwargs)

  File "/home/site-packages/adet/data/dataset_mapper.py", line 133, in __call__
    transforms = aug_input.apply_augmentations(self.augmentation)

  File "/home/appuser/.local/lib/python3.6/site-packages/detectron2/data/transforms/augmentation.py", line 347, in apply_augmentations
    return AugmentationList(augmentations)(self)

  File "/home/appuser/.local/lib/python3.6/site-packages/detectron2/data/transforms/augmentation.py", line 264, in __call__
    tfm = x(aug_input)

  File "/home/appuser/.local/lib/python3.6/site-packages/detectron2/data/transforms/augmentation.py", line 165, in __call__
    tfm = self.get_transform(*args)

  File "/home/appuser/.local/lib/python3.6/site-packages/detectron2/data/transforms/augmentation_impl.py", line 176, in get_transform
    neww = int(neww + 0.5)

ValueError: cannot convert float NaN to integer

NO2-yh commented 3 years ago

@anruirui 我发现问题所在了，我这边问题是出现在图片太小了，crop的时候会导致有除以0的情况，因此我把crop这一块给关了，即INPUT.CROP.ENABLED: False 你可以看看你出现的问题是否跟我一样

anruirui commented 3 years ago

@NO2-yh 谢谢