Closed lyh124 closed 1 year ago
您好,我在参考preprocess_syngec.sh对nlpcc2018的test数据进行处理时,在执行preprocess.py的过程中遇到了错误,错误信息如下:
2022-12-27 11:43:33 | INFO | fairseq_cli.preprocess | Namespace(align_suffix=None, alignfile=None, all_gather_list_size=16384, bf16=False, bpe=None, checkpoint_shard_count=1, checkpoint_suffix='', conll_suffix=['conll'], cpu=False, criterion='cross_entropy', dataset_impl='mmap', destdir='../../preprocess/chinese_nlpcc2018_with_syntax_transformer/bin', dpd_suffix=['dpd'], empty_cache_freq=0, fp16=False, fp16_init_scale=128, fp16_no_flatten_grads=False, fp16_scale_tolerance=0.0, fp16_scale_window=None, joined_dictionary=False, labeldict=['../../data/dicts/syntax_label_gec.dict'], log_format=None, log_interval=100, lr_scheduler='fixed', memory_efficient_bf16=False, memory_efficient_fp16=False, min_loss_scale=0.0001, model_parallel_size=1, no_progress_bar=False, nwordssrc=-1, nwordstgt=-1, only_source=True, optimizer=None, padding_factor=8, probs_suffix=['probs'], profile=False, quantization_config_path=None, scoring='bleu', seed=1, source_lang='src', source_lang_with_nt=None, srcdict='../../data/dicts/chinese_vocab.count.txt', swm_suffix='swm', target_lang='tgt', task='syntax-enhanced-translation', tensorboard_logdir=None, testpref='../../preprocess/chinese_nlpcc2018_with_syntax_transformer/test.char', tgtdict='../../data/dicts/chinese_vocab.count.txt', threshold_loss_scale=None, thresholdsrc=0, thresholdtgt=0, tokenizer=None, tpu=False, trainpref=None, user_dir='../../src/src_syngec/syngec_model', validpref=None, workers=32) 2022-12-27 11:43:33 | INFO | fairseq_cli.preprocess | [src] Dictionary: 21132 types 2022-12-27 11:43:33 | INFO | fairseq_cli.preprocess | [src] ../../preprocess/chinese_nlpcc2018_with_syntax_transformer/test.char.src: 2000 sents, 60976 tokens, 0.0% replaced by 正在处理: ../../preprocess/chinese_nlpcc2018_with_syntax_transformer/test.char.src Traceback (most recent call last): File "../../src/src_syngec/fairseq-0.10.2/fairseq_cli/preprocess.py", line 551, in cli_main() File "../../src/src_syngec/fairseq-0.10.2/fairseq_cli/preprocess.py", line 547, in cli_main main(args) File "../../src/src_syngec/fairseq-0.10.2/fairseq_cli/preprocess.py", line 421, in main make_all_matrix("conll") File "../../src/src_syngec/fairseq-0.10.2/fairseq_cli/preprocess.py", line 408, in make_all_matrix lang=args.source_lang File "../../src/src_syngec/fairseq-0.10.2/fairseq_cli/preprocess.py", line 269, in make_binary_matrix_dataset input_list = pickle.load(open(input_file, "rb")) # input_matrix_list是预先处理好的包含矩阵的list文件, _pickle.UnpicklingError: invalid load key, '\xe5'.
test.char.src中部分数据如下:
请问您对此问题有什么头绪吗?
请检查下句法矩阵文件(conll)是否是二进制的numpy文件
好的,感谢您的回复
所以请问最后是怎么解决的?我这边也遇到了同样的问题
您好,我在参考preprocess_syngec.sh对nlpcc2018的test数据进行处理时,在执行preprocess.py的过程中遇到了错误,错误信息如下:
2022-12-27 11:43:33 | INFO | fairseq_cli.preprocess | Namespace(align_suffix=None, alignfile=None, all_gather_list_size=16384, bf16=False, bpe=None, checkpoint_shard_count=1, checkpoint_suffix='', conll_suffix=['conll'], cpu=False, criterion='cross_entropy', dataset_impl='mmap', destdir='../../preprocess/chinese_nlpcc2018_with_syntax_transformer/bin', dpd_suffix=['dpd'], empty_cache_freq=0, fp16=False, fp16_init_scale=128, fp16_no_flatten_grads=False, fp16_scale_tolerance=0.0, fp16_scale_window=None, joined_dictionary=False, labeldict=['../../data/dicts/syntax_label_gec.dict'], log_format=None, log_interval=100, lr_scheduler='fixed', memory_efficient_bf16=False, memory_efficient_fp16=False, min_loss_scale=0.0001, model_parallel_size=1, no_progress_bar=False, nwordssrc=-1, nwordstgt=-1, only_source=True, optimizer=None, padding_factor=8, probs_suffix=['probs'], profile=False, quantization_config_path=None, scoring='bleu', seed=1, source_lang='src', source_lang_with_nt=None, srcdict='../../data/dicts/chinese_vocab.count.txt', swm_suffix='swm', target_lang='tgt', task='syntax-enhanced-translation', tensorboard_logdir=None, testpref='../../preprocess/chinese_nlpcc2018_with_syntax_transformer/test.char', tgtdict='../../data/dicts/chinese_vocab.count.txt', threshold_loss_scale=None, thresholdsrc=0, thresholdtgt=0, tokenizer=None, tpu=False, trainpref=None, user_dir='../../src/src_syngec/syngec_model', validpref=None, workers=32) 2022-12-27 11:43:33 | INFO | fairseq_cli.preprocess | [src] Dictionary: 21132 types 2022-12-27 11:43:33 | INFO | fairseq_cli.preprocess | [src] ../../preprocess/chinese_nlpcc2018_with_syntax_transformer/test.char.src: 2000 sents, 60976 tokens, 0.0% replaced by
正在处理:
../../preprocess/chinese_nlpcc2018_with_syntax_transformer/test.char.src
Traceback (most recent call last):
File "../../src/src_syngec/fairseq-0.10.2/fairseq_cli/preprocess.py", line 551, in
cli_main()
File "../../src/src_syngec/fairseq-0.10.2/fairseq_cli/preprocess.py", line 547, in cli_main
main(args)
File "../../src/src_syngec/fairseq-0.10.2/fairseq_cli/preprocess.py", line 421, in main
make_all_matrix("conll")
File "../../src/src_syngec/fairseq-0.10.2/fairseq_cli/preprocess.py", line 408, in make_all_matrix
lang=args.source_lang
File "../../src/src_syngec/fairseq-0.10.2/fairseq_cli/preprocess.py", line 269, in make_binary_matrix_dataset
input_list = pickle.load(open(input_file, "rb")) # input_matrix_list是预先处理好的包含矩阵的list文件,
_pickle.UnpicklingError: invalid load key, '\xe5'.
test.char.src中部分数据如下:
请问您对此问题有什么头绪吗?