fxsjy / jieba

结巴中文分词
MIT License
33.41k stars 6.73k forks source link

Tokenizer.gen_pfdict method does not guard against duplicate entries. #977

Open ericlingit opened 2 years ago

ericlingit commented 2 years ago

When generating a prefix dictionary from dict.txt, duplicate entries are still added to the total variable ltotal.

dup-in-dict-txt

The term frequency for B超 3 n is added twice in gen_pfdict() method. As a result, the returned total is off by 3.

The sum of term frequency should be 60,101,964, but 60,101,967 is returned.