codertimo / BERT-pytorch

Google AI 2018 BERT pytorch implementation
Apache License 2.0
6.11k stars 1.29k forks source link

Vocab Replace \t to blank issue #33

Open NiHaoUCAS opened 5 years ago

NiHaoUCAS commented 5 years ago

when the corpus is: how are you \ tnice to meet you and apply bert-vocab cmd, the output of the vacab is ['<pad>', '<unk>', '<eos>', '<sos>', '<mask>', 'you', 'are', 'how', 'meet', 'nice', 'to'].
But when change the corputs to how are you\tnice to meet you, the result is ['<pad>', '<unk>', '<eos>', '<sos>', '<mask>', 'are', 'how', 'meet', 'to', 'you', 'younice'], the last token become younice. a <'blank'> need on both sides of <'\t'>. it's may not a bug.

jiqiujia commented 5 years ago

I think this is a bug. And the problem is that in vocab.y the 127th line words = line.replace("\n", "").replace("\t", "").split() \t is replaced by "". I think it should by replaced by a space.

codertimo commented 5 years ago

I'll update the vocab builder ASAP! thanx