Evaluating mT5-Large Text Model on German dataset

gsmoon97 commented 1 year ago

Hi, I am trying to reproduce experiment results for the mT5-Large text model (F0.5 score of 70.1), as presented in Table 3 of your paper.

Would it be possible to share the script you have used to train the mT5-Large text model on CLang8 German data and evaluate the performance on the Falko-MERLIN German dataset? I have tried to find the settings of the hyper-parameters used for this experiment, but I could only find the hyper-parameters for evaluating multimodal GEC models in Appendix A.2 of your paper.

Thank you!

fangtao-123 commented 1 year ago

Hi Moon，

The experiment results for the mT5-Large text model (F0.5 score of 70.1) are directly copied from the original paper (Rothe et al.).

If you are interested in reproducing these experiment results, there are two methods you can consider:

(1) Using the Huggingface/transformer tool. Additionally, you may find Table 10 in our other paper by Fang et al. useful for reference. To facilitate your replication, we have provided specific parameters in this link.

(2) Another option is to modify our code by changing line 285 of the code to pass only the hidden state of the text and line 542 to update the loss as follows: loss = loss.

Best regards, Tao

gsmoon97 commented 1 year ago

Hi,

Thank you so much for the reply. I am currently trying out method (2) to reproduce the experiment results. Other than the modifications to the code you mentioned, is there any other changes I would need to make or can I just run the provided train_and_eval_German.sh? For instance, I initially thought I would have to modify few parameters in train_and_eval_German.sh, such as to toggle multimodal from True to False.

Thank you for your help!

fangtao-123 commented 1 year ago

You may need to debug the code, but the shell script should be fine without any adjustments. However, it might be necessary to change some hyperparameters. I recommend focusing on the first approach, as the model created through this method can also serve as an initial foundation for subsequent multimodal text models (if needed).

gsmoon97 commented 1 year ago

Thank you for the prompt reply! I have several follow-up questions.

May I confirm that the method below is expected to produce the same results as the ablation results presented in Table 4 of your paper? More specifically, from following the method below, should I expect results that correspond to the OURS((M)T5) - Speech Enc. after testing on the Falko-MERLIN dataset?

(2) Another option is to modify our code by changing line 285 of the code to pass only the hidden state of the text and line 542 to update the loss as follows: loss = loss.

My understanding of the provided script is that it should reproduce experiment results for mT5-Large model and not PMT5-Large model (from Table 3), as it does not pre-train on the 10M German synthetic data. May I confirm if this is the case?
May I clarify on how you fine-tune on the official Falko-MERLIN training data after training on the CLang8 training data? It seems like the provided code only takes in one file for the training data, but the training data for CLang8 and Falko-MERLIN were released as two different files. Would it be possible to reproduce the results by simply combining these files or would it require modification to the provided source code to first train on CLang8 data, followed by a subsequent round of training on Falko-MERLIN data?

Thank you so much and I appreciate your help.

fangtao-123 commented 1 year ago

Yes, following the provided method should produce results that correspond to the "OURS((M)T5) - Speech Enc" in Table 4 of the paper.
Yes, your understanding is correct. The provided script is expected to reproduce experiment results for the mT5-Large model, not the PMT5-Large model from Table 3, as it does not pre-train on the 10M German synthetic data.
The fine-tuning process involves training the model first on the CLang8 training data and then fine-tuning it on the Falko-MERLIN training data (As shown in Section 4.1 in the paper). However, it is possible to try combining these two datasets and use the combined dataset for fine-tuning. The provided source code may need some modifications to accommodate this combined dataset, but it is worth trying to see if it produces comparable results.

NLP2CT / MultimodalGEC

Evaluating mT5-Large Text Model on German dataset #3