使用UER-py框架训练得到古诗词如同乱码该怎么办

您好，我使用UER的GPT-2预训练方法训练了一个古诗的模型，然后做预测的时候发现生成的就好像是随机的文本，有时甚至还有很多[UNK]，想请教下这是为什么？我的输入是“床前明月光，”

我的数据是nlp_chinese_corpus这个仓库中的中国诗词，我按照book_review.txt的形式，一行放了一首诗，然后按照您给的GPT-2预训练示例的输入指令进行预处理与预训练，我的数据文本大概是这样子的

预训练的指令如下： CUDA_VISIBLE_DEVICES=1 python3 pretrain.py --dataset_path datasets/poems.pt --vocab_path models/google_zh_vocab.txt --output_model_path models/poems_model.bin --config_path models/gpt2/config.json --learning_rate 1e-4 --world_size 1 --gpu_ranks 0 --tie_weight --embedding word_pos --remove_embedding_layernorm --encoder transformer --mask causal --layernorm_positioning pre --target lm

测试的指令如下： python generate_lm.py --load_model_path ../models/poems_model.bin-100000 --vocab_path ../models/google_zh_vocab.txt --test_path ../corpora/test_poems.txt --prediction_path ../corpora/predicted.txt --config_path ../models/gpt2/config.json

Morizeyao / GPT2-Chinese

使用UER-py框架训练得到古诗词如同乱码该怎么办 #201