dataset - Githubissues

Lj4040 commented 1 year ago

Hello, after reading your paper, I think your work is very good! So I tried to learn your work. When generating English voice files, the generation speed of the English CLang8 dataset was a little slow, because the dataset was relatively large. I used a single gpu3090 to generate the data for two days and still hadn't finished it, even though I only ran once. I see that the script file you gave me has to run 00/01/02 3 times, does that mean I have to generate the same voice file 3 times? Is there any way to speed it up? Also, the English CLang8 dataset needs to be divided into several different directories, because there are too many CLang8 data, I see that the German CL8 data is divided into many directories, each directory has many voices. Thank you for your reply！

fangtao-123 commented 1 year ago

Hi, Yes, the English TTS model generates relatively slowly. Remember that you only need to generate the data once. For example, 00, 01, 02..., these refer to folder numbers since the dataset is quite large and divided into many subfolders. I suggest you can also consider using Google's Text-to-Speech (gTTS) API to generate MP3 formats, which may be much faster and save memory.

Lj4040 commented 1 year ago

Thank you very much for your reply! I have another question and I hope you can help me answer it. I checked the code and found that in the validation set, the model was not saved, but the modification content of the syntax error statement was directly generated like in the test set. So how do you choose the best file? Are each generated candidate file, also known as a. candidate file, sequentially entered into the M2 scorer to select the best result? I looked at the parameters and found that the epoch for German training is 21, and the epoch for entering the test set is 3. That is to say, it will generate 7 times, and then compare the generated 7 candidates one by one to select the best one? Or do you choose the best one based on the rouge value recorded in the Log?

fangtao-123 commented 1 year ago

You should first evaluate the development set using M2 scores without the need to choose the best one based on the ROUGE value recorded in the log. Then, select the best outcome and choose the result for the test set that corresponds to the same epoch.
Yes, it will generate 7 times. You can also try to adjust these parameters.

NLP2CT / MultimodalGEC

dataset #4