Open NiHaoUCAS opened 5 years ago
I think this is a bug. And the problem is that in vocab.y the 127th line
words = line.replace("\n", "").replace("\t", "").split()
\t is replaced by "". I think it should by replaced by a space.
I'll update the vocab builder ASAP! thanx
when the corpus is:
how are you \ tnice to meet you
and applybert-vocab
cmd, the output of the vacab is['<pad>', '<unk>', '<eos>', '<sos>', '<mask>', 'you', 'are', 'how', 'meet', 'nice', 'to']
.But when change the corputs to
how are you\tnice to meet you
, the result is['<pad>', '<unk>', '<eos>', '<sos>', '<mask>', 'are', 'how', 'meet', 'to', 'you', 'younice']
, the last token becomeyounice
. a <'blank'> need on both sides of <'\t'>. it's may not a bug.