PyThaiNLP / pythainlp

Thai Natural Language Processing in Python.
https://pythainlp.org/
Apache License 2.0
936 stars 272 forks source link

bug: Warning: Duplicate word in word2vec file #887

Open bact opened 6 months ago

bact commented 6 months ago

Description

There are hundreds of warnings like this during unit test:

2023-12-11:03:40:47 WARNING  [gensim.models.keyedvectors:1909] duplicate word 'ต่าง' in word2vec file, ignoring all but first

Expected results

No warning.

Current results

(partial)

2023-12-11:03:40:47 WARNING  [gensim.models.keyedvectors:1909] duplicate word 'ต่าง' in word2vec file, ignoring all but first
2023-12-11:03:40:47 WARNING  [gensim.models.keyedvectors:1909] duplicate word ' ' in word2vec file, ignoring all but first
...
2023-12-11:03:40:57 WARNING  [gensim.models.keyedvectors:1909] duplicate word '' in word2vec file, ignoring all but first
2023-12-11:03:40:58 WARNING  [gensim.models.keyedvectors:1909] duplicate word 'หยับ' in word2vec file, ignoring all but first

Steps to reproduce

Run unit test

PyThaiNLP version

dev

Python version

3.8

Operating system and version

n/a

More info

No response

Possible solution

No response

Files

No response