HillZhang1999 / SynGEC

Code & data for our EMNLP2022 paper "SynGEC: Syntax-Enhanced Grammatical Error Correction with a Tailored GEC-Oriented Parser"
https://arxiv.org/abs/2210.12484
MIT License
79 stars 14 forks source link

在preprocess阶段遇到的问题 #17

Closed lyh124 closed 1 year ago

lyh124 commented 1 year ago

您好,我在参考preprocess_syngec.sh对nlpcc2018的test数据进行处理时,在执行preprocess.py的过程中遇到了错误,错误信息如下:

2022-12-27 11:43:33 | INFO | fairseq_cli.preprocess | Namespace(align_suffix=None, alignfile=None, all_gather_list_size=16384, bf16=False, bpe=None, checkpoint_shard_count=1, checkpoint_suffix='', conll_suffix=['conll'], cpu=False, criterion='cross_entropy', dataset_impl='mmap', destdir='../../preprocess/chinese_nlpcc2018_with_syntax_transformer/bin', dpd_suffix=['dpd'], empty_cache_freq=0, fp16=False, fp16_init_scale=128, fp16_no_flatten_grads=False, fp16_scale_tolerance=0.0, fp16_scale_window=None, joined_dictionary=False, labeldict=['../../data/dicts/syntax_label_gec.dict'], log_format=None, log_interval=100, lr_scheduler='fixed', memory_efficient_bf16=False, memory_efficient_fp16=False, min_loss_scale=0.0001, model_parallel_size=1, no_progress_bar=False, nwordssrc=-1, nwordstgt=-1, only_source=True, optimizer=None, padding_factor=8, probs_suffix=['probs'], profile=False, quantization_config_path=None, scoring='bleu', seed=1, source_lang='src', source_lang_with_nt=None, srcdict='../../data/dicts/chinese_vocab.count.txt', swm_suffix='swm', target_lang='tgt', task='syntax-enhanced-translation', tensorboard_logdir=None, testpref='../../preprocess/chinese_nlpcc2018_with_syntax_transformer/test.char', tgtdict='../../data/dicts/chinese_vocab.count.txt', threshold_loss_scale=None, thresholdsrc=0, thresholdtgt=0, tokenizer=None, tpu=False, trainpref=None, user_dir='../../src/src_syngec/syngec_model', validpref=None, workers=32) 2022-12-27 11:43:33 | INFO | fairseq_cli.preprocess | [src] Dictionary: 21132 types 2022-12-27 11:43:33 | INFO | fairseq_cli.preprocess | [src] ../../preprocess/chinese_nlpcc2018_with_syntax_transformer/test.char.src: 2000 sents, 60976 tokens, 0.0% replaced by 正在处理: ../../preprocess/chinese_nlpcc2018_with_syntax_transformer/test.char.src Traceback (most recent call last): File "../../src/src_syngec/fairseq-0.10.2/fairseq_cli/preprocess.py", line 551, in cli_main() File "../../src/src_syngec/fairseq-0.10.2/fairseq_cli/preprocess.py", line 547, in cli_main main(args) File "../../src/src_syngec/fairseq-0.10.2/fairseq_cli/preprocess.py", line 421, in main make_all_matrix("conll") File "../../src/src_syngec/fairseq-0.10.2/fairseq_cli/preprocess.py", line 408, in make_all_matrix lang=args.source_lang File "../../src/src_syngec/fairseq-0.10.2/fairseq_cli/preprocess.py", line 269, in make_binary_matrix_dataset input_list = pickle.load(open(input_file, "rb")) # input_matrix_list是预先处理好的包含矩阵的list文件, _pickle.UnpicklingError: invalid load key, '\xe5'.

test.char.src中部分数据如下:

image

请问您对此问题有什么头绪吗?

HillZhang1999 commented 1 year ago

请检查下句法矩阵文件(conll)是否是二进制的numpy文件

lyh124 commented 1 year ago

好的,感谢您的回复

GMago-LeWay commented 1 year ago

好的,感谢您的回复

所以请问最后是怎么解决的?我这边也遇到了同样的问题