brightmart / bert_language_understanding

Pre-training of Deep Bidirectional Transformers for Language Understanding: pre-train TextCNN
960 stars 211 forks source link

tokenize_style=char的问题 #10

Closed godfatherzzx closed 6 years ago

godfatherzzx commented 6 years ago

我利用网盘下载了中文语料,设置tokenize_style=char,在pretrain_task.py文件71行和232行: string_list=[x for x in jieba.lcut(sentence.strip()) if x and x not in ["\"",":","、",",",")","("]] string_list = [x for x in jieba.lcut(sentence.strip()) if x and x not in ["\"", ":", "、", ",", ")", "("]] 可能也需要根据开关设置不同的处理方式: string_list = [x for x in sentence.strip() if x and x not in ["\"", ":", "、", ",", ")", "("]]

非常感谢你的工作。

brightmart commented 6 years ago

yeah, you are right.