ddangelov / Top2Vec

Top2Vec learns jointly embedded topic, document and word vectors.
BSD 3-Clause "New" or "Revised" License
2.94k stars 374 forks source link

Topics collapsed to small numbers with more documents #6

Closed Benja1972 closed 4 years ago

Benja1972 commented 4 years ago

Hello, With new version, where you added possibility of make hierarchy, number of topics could strangely collapsed to few only topics.

Here an example. I create corpus with two type of documents from different domains, 'computer vision' and 'genomics', each set of documents contains around 3000 documents. The code is bellow

def lemma(docs):
    lemmatizer = WordNetLemmatizer()
    docs = [[lemmatizer.lemmatize(token) for token in doc] for doc in docs]
    return docs

path = "../data"

f_in = join(path,"cv.txt")
f_ig = join(path,"gn.txt")

# Corpus =================
docs = []
print('Loading docs ...')
with open(f_in, 'r') as fin:
    for doc in fin:
        docs.append([dc.strip('\r\n') for dc in doc.split(' ')])
with open(f_ig, 'r') as fin:
    for doc in fin:
        docs.append([dc.strip('\r\n') for dc in doc.split(' ')])

# Lemmatize the documents.
docs = lemma(docs)

texts = [' '.join(doc) for doc in docs]

# Model ======================
model = Top2Vec(documents=texts, speed='deep-learn',workers=8) 

When I use previous version it finds 59 topics as expected as single domain set gives around 30 topics. If I use new version with hierarchy, it gives only 2 topics. All topics collapsed to high level clusters.

ddangelov commented 4 years ago

With a small number of documents(<10,000) especially if they are short this may happen. This is due to the stochastic nature of doc2vec and umap. Have you tried running it multiple times? This shouldn't be due to the new version. Also did you recently use version 1.0.8? Or was it a different version?

Benja1972 commented 4 years ago

I will check version, but I have mentioned this type of behavior as soon as I use version from github with hierarchy. Before I played with different set of documents and number of topics varied but little, like 30 to 34. Now I see huge difference - same set of documents produce 60 topics for old version and 2 for new. I will try run several times new version see if it can give around 60 topics for some run.

Benja1972 commented 4 years ago

I will close the issue, I need to make more test and understanding. Thank you for feedback