Training with our own data

mattian7 commented 1 year ago

你好！请问如果我想用自己的数据训练SynGEC，我是一共需要更改 ./bash/*_exp 中的pipeline_gopar.sh，preprocess_syngec_*.sh，train_syngec_*.sh和generate_syngec_*.sh这4个脚本吗？还有别的什么地方需要改吗？可以麻烦您说一下具体的流程吗？

HillZhang1999 commented 1 year ago

感谢关注！我最近会把整体流程写得更详细一些

HillZhang1999 commented 1 year ago

Hi, here are the detailed process of using your own data to train SynGEC: 1) Prepare your dataset: first, you should pre-process your sentence-pair data into src.txt and tgt.txt (for test data, only src.txt is required), each line has one source/target sentence (do not perform word segmentation).

2) Train a GOPar and predict: You need to change the data_dir variable in pipeline_gopar.sh to the directory of your own data in step 1. Then, you should run pipeline_gopar.sh to train a GOPar and use it to get the predicted syntactic information of all of your data.

Please kindly note that: To alleviate overfitting, you should use jack-knifing if you prepare to use the same data to train GEC models as the data used to build GOPar. jack-knifing means you should split your data into n folds, and select n-1 folds to train a GOPar then predict on the leave-out 1 fold, and repeat n-1 times to predict the syntactic information of the whole data. In our paper, we only use clang8 to train the GOPar for English, and use HSK to train the GOPar for Chinese, both are just parts of the full GEC training data.

3) Pre-process data for GEC training: you should follow the instructions in preprocess_syngec_*.sh to binarize all sentence-pair files and syntax-related files for GEC training. You just need to change the paths TRAIN_SRC_FILE/TRAIN_TGT_FILE/VALID_SRC_FILE/VALID_TGT_FILE.

4) Train SynGEC: change the training data directories, e.g., PROCESSED_DIR_STAGE1, in train_syngec_*.sh, and rename the model saving directory, and run.

5) Make predictions with SynGEC: see generate_syngec_*.sh.

You will hopefully get the final GEC results if you follow the above steps. Please try and let me know if you have any problems.

mattian7 commented 1 year ago

Thanks a lot ! I will try it as soon as possible.

HillZhang1999 / SynGEC

Training with our own data #4