brightmart / albert_zh

A LITE BERT FOR SELF-SUPERVISED LEARNING OF LANGUAGE REPRESENTATIONS, 海量中文预训练ALBERT模型
https://arxiv.org/pdf/1909.11942.pdf
3.92k stars 755 forks source link

预训练语料构造问题 #161

Open mwei314 opened 3 years ago

mwei314 commented 3 years ago

https://github.com/brightmart/albert_zh/blob/652faed6b362c730eb046e9a2e5620d898736a01/create_pretraining_data.py#L567 中文jieba分词后tokens带有##前缀,而output_tokens去除了##,两者不一致会不会影响预训练效果?