Bad results trying to train using end_to_end.sh

bexxxxx commented 4 years ago

Hi,

I am trying to run end_to_end.sh, following the instructions in the example_scripts README, however, I am getting bad predictions and therefore cannot reproduce anywhere near the 26.6 F-score mentioned.

I have corrected the typo in end_to_end.sh (./m2eval.sh --> ./m2_eval.sh) and am using https://storage.googleapis.com/bert_models/2018_10_18/cased_L-24_H-1024_A-16.zip on my end, so the code runs, but for example I see results like this:

Test data : Keeping the Secret of Genetic Testing

Prediction round 1: . . . . . . . keeping on a on for the Good , and Secret . . So . . , and of , the , and the , and genetic , and , and for the Testing . . . . . . . . [SEP] . . . . . . . [SEP] .

Any tips on where I'm going wrong?

awasthiabhijeet commented 4 years ago

Hi @bexxxxx , Apologies for the late response. Are you still facing the same problem?

Since you are using end_to_end.sh, I assume you wish to train your model from scratch (instead of fine-tuning the existing PIE model) Did you check the outputs obtained after 'preprocess.sh' is completed? Please check if pickle files containing word pieces are generated as you expect it to be. Also, check if your parallel corpus is properly converted into the form of incorrect tokens and edit ids. You may want your model to be initialized with pre-trained BERT ckpt. Please double check all the paths you provide in the scripts.

bexxxxx commented 4 years ago

Hi, thank you for your response, I am still facing this issue.

My initial goal was simply clone and reproduce your results in the example scripts directory. However, that leads to the bad results above, despite cloning the repo and leaving it untouched except for the typo, downloading the bert model, and setting tpu to false.

I have double checked paths, pickle file generation, I am using bert.ckpt to initialize, and the incorrect tokens and edit ids appear to be okay. Here are the first 5 for each:

Incorrect tokens 

101 1422 1411 1110 170 5143 2060 1331 1114 14654 4032 4131 119 102
101 1135 1144 170 1344 3476 1416 1272 1157 1353 3441 119 102
101 2711 1104 1122 1110 1126 3924 1331 117 1175 1132 1242 7116 1105 2853 4822 119 102
101 146 18029 5807 1103 8246 3521 1107 1103 172 7340 1200 1104 1103 1331 1134 1110 4405 1118 170 2493 119 102
101 11415 15297 1132 1304 1887 1105 1211 1104 1172 2906 1103 1957 20392 1121 1103 1331 119 102

Edit Ids

3 3 3 3 3 25 1851 3 3 3 3 3 3 3
3 3 3 3 25 3 3 14 3 3 3 3 3
3 1171 4 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
3 3 3 3 3 3 3 3 3 4 1172 4 3 3 3 3 3 3 3 3 3 3 3
3 3 3 3 3 3 3 3 3 3 3 3 3 3 936 3 3 3 3

I have also attempted fine tuning — after reading (https://github.com/awasthiabhijeet/PIE/issues/10). I assumed I needed to change the init-checkpoint in pie_train.sh from bert to pie_model.ckpt. And then run end_to_end.sh with the new init-checkpoint,. Using your provided test data, my predictions again came out poorly.

Can you confirm if that was the correct process for fine-tuning?

awasthiabhijeet commented 4 years ago

Hi @bexxxxx Yes the process you are following looks correct. Did you check your M2 score on the test data after fine-tuning BERT ckpt on 1000 sentences? BTW, training on just 1000 sentences is just for demonstration. Ideally, you would want to extract pickle files from much larger training data. So, in addition to initializing your model with pie_model.ckpt, are you also using the pickle files provided here ? These pickles were extracted from the original training set.

awasthiabhijeet commented 4 years ago

Ideally, the [SEP] token should not appear in your output. I am not sure what is going wrong over there.

bexxxxx commented 4 years ago

@awasthiabhijeet thank you for your continued help. I'll focusing on fine-tuning PIE, to make it simpler to solve one problem at a time.

I am using the provided pickle files, since I have commented out the code in preprocess.sh and replaced the relevant paths (in multi_round_infer.sh, pie_infer,.sh pie_train.sh, and preprocess.sh) with pickles/conll (as mentioned here https://github.com/awasthiabhijeet/PIE/issues/10). I then run end_to_end.sh and these are my results:

Splitting and converting to integers
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1312/1312 [00:00<00:00, 11391.05it/s]
Applying opcodes
 25%|██████████████████████████████████████                                                                                                                | 333/1312 [00:00<00:02, 346.00it/s]transform
**** WARNING: SUFFIX : e NOT PRESENT in WORD: occur ****
transform
**** WARNING: SUFFIX : e NOT PRESENT in WORD: occur ****
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1312/1312 [00:02<00:00, 466.16it/s]
Wrote scratch/multi_round_3_test_predictions.txt
Precision   : 0.6303
Recall      : 0.4603
F_0.5       : 0.5869

The F score is quite high, does this look right to you?

awasthiabhijeet commented 4 years ago

Hi @bexxxxx Indeed. This looks right to me.

bexxxxx commented 4 years ago

Should I be concerned about the warning or Applying opcodes not getting to 100%?

awasthiabhijeet commented 4 years ago

Apply codes do reach 100% as per what you have shared. At 25% there was a warning that "SUFFIX : e NOT PRESENT in WORD: occur" This simply warns that there was a transform predicted which expected an "e" suffix in the word to be edited. But since "e" is not present in the word ("occur"), the transform is not applied. A well-trained model should ideally lead to lesser warnings.

Also, since you are fine-tuning the pre-trained model on just 1000 sentences, performance dropped from F_0.5=60.xx to F_0.5 = 58.69 (most likely due to some overfitting)

bexxxxx commented 4 years ago

Sorry for the late reply -- thank you for all of your help in understanding this!

awasthiabhijeet commented 4 years ago

Thanks for your interest in our work. Grammarly has recently built upon PIE: https://arxiv.org/pdf/2005.12592.pdf

I believe the key differences in their work are largely related to the finetuning strategy and trying different encoders. They also augment the word-level transformations with new operations.

Closing this issue.

awasthiabhijeet / PIE

Bad results trying to train using end_to_end.sh #18