Closed Benja1972 closed 4 years ago
With a small number of documents(<10,000) especially if they are short this may happen. This is due to the stochastic nature of doc2vec and umap. Have you tried running it multiple times? This shouldn't be due to the new version. Also did you recently use version 1.0.8? Or was it a different version?
I will check version, but I have mentioned this type of behavior as soon as I use version from github with hierarchy. Before I played with different set of documents and number of topics varied but little, like 30 to 34. Now I see huge difference - same set of documents produce 60 topics for old version and 2 for new. I will try run several times new version see if it can give around 60 topics for some run.
I will close the issue, I need to make more test and understanding. Thank you for feedback
Hello, With new version, where you added possibility of make hierarchy, number of topics could strangely collapsed to few only topics.
Here an example. I create corpus with two type of documents from different domains, 'computer vision' and 'genomics', each set of documents contains around 3000 documents. The code is bellow
When I use previous version it finds 59 topics as expected as single domain set gives around 30 topics. If I use new version with hierarchy, it gives only 2 topics. All topics collapsed to high level clusters.