fastnlp / fastNLP

fastNLP: A Modularized and Extensible NLP Framework. Currently still in incubation.
https://gitee.com/fastnlp/fastNLP
Apache License 2.0
3.06k stars 450 forks source link

Vocabulary类在build_vocab时的bug #252

Closed keezen closed 4 years ago

keezen commented 4 years ago

Vocabulary类在build_vocab时有一个bug:当padding和unknown为同一个token时,index会出现问题,如

tag_vocab = Vocabulary(min_freq=None, padding="o", unknown="o").from_dataset(train_set, test_set, field_name="tags")

会产生的word2idx为:

print(tag_vocab.word2idx)
# {'o': 1, 'b':1, ...}

bug代码见 https://github.com/fastnlp/fastNLP/blob/980aba9898d2c33689b88ad41f9cf173ef9e2e31/fastNLP/core/vocabulary.py#L212

建议改为

if self.unknown is not None ans self.unknown is not in self._word2idx:
keezen commented 4 years ago

@xpqiu @choosewhatulike

xuyige commented 4 years ago

好的,我打一个补丁