Closed keezen closed 4 years ago
Vocabulary类在build_vocab时有一个bug:当padding和unknown为同一个token时,index会出现问题,如
tag_vocab = Vocabulary(min_freq=None, padding="o", unknown="o").from_dataset(train_set, test_set, field_name="tags")
会产生的word2idx为:
print(tag_vocab.word2idx) # {'o': 1, 'b':1, ...}
bug代码见 https://github.com/fastnlp/fastNLP/blob/980aba9898d2c33689b88ad41f9cf173ef9e2e31/fastNLP/core/vocabulary.py#L212
建议改为
if self.unknown is not None ans self.unknown is not in self._word2idx:
@xpqiu @choosewhatulike
好的,我打一个补丁
Vocabulary类在build_vocab时有一个bug:当padding和unknown为同一个token时,index会出现问题,如
会产生的word2idx为:
bug代码见 https://github.com/fastnlp/fastNLP/blob/980aba9898d2c33689b88ad41f9cf173ef9e2e31/fastNLP/core/vocabulary.py#L212
建议改为