ddangelov / Top2Vec

Top2Vec learns jointly embedded topic, document and word vectors.
BSD 3-Clause "New" or "Revised" License
2.95k stars 374 forks source link

`ngram_vocab=True` ignores single words in vocabulary #364

Open mirorac opened 17 hours ago

mirorac commented 17 hours ago

When ngram_vocab=True is used, single words seem to be ignored in the vocabulary. In previous versions, this behavior did not occur, so I wanted to check if this change was intentional or an unintended regression.

Here’s the relevant line in the code:
https://github.com/ddangelov/Top2Vec/blob/2435731bc834f49aa22b38d46102bc37b960dffc/top2vec/top2vec.py#L890

Suggested fix:

vocab += phrases  

Could merging the phrases with the previously built vocabulary resolve the issue, or is this the expected behavior in the latest version?

ddangelov commented 17 hours ago

It was intentional, as single words would often end up as top topic words rather than the ngrams.