-
Goal is to auto-detect a coherent cluster of “hard” examples (ie. data slice) where models predictions are poor. Cf:
https://dcai.csail.mit.edu/lectures/data-centric-evaluation/
This should be a […
-
When using the embedding_output.sav file to do exploratory analysis on clusters found from unsupervised learning, I tried to open the file via SPSS. How are others extracting information from this fil…
pozel updated
9 months ago
-
I have trained a Bertopic model in the following way, given a vocabulary of keywords:
```
vectorizer_model = CountVectorizer(vocabulary=vocabulary)
sentence_model = SentenceTransformer("distiluse…
-
微博内容精选
-
Hello, @MaartenGr
I have been using the bertopic algorithm and you have noticed that the number of documents classified as -1 topic is quite high, ranging from 30% to 50% of the total documents. …
-
## QUESTION:
I want to create a BERTopic model architecture that will be able to extract topics from any list of documents and still give reasonable results when fitted to said documents. Is it eve…
-
Hello MaartenGr, I did not set the parameter nr_ topics when using Bertopic to process my data (30000 entries). In the end, 512 topics were obtained, but a lot of data (10000 items) were classified as…
-
1. I want to know why when I run the BerTopic different times I get different results (topics etc..). I am also interested on the theoretical point of view I guess it has something to do with random p…
-
Hello,
I am working with a very large corpus of around 3M documents. Thus, I wanted to increase the min_cluster_size in HDBSCAN to 500 to decrease the number of topics. Moreover, small topics with …
-
Hi,
I'm using cuML since I have a large dataset, around 1 million Reddit posts.
When I use standard methods and parameters as below, I have kind of ok results, but with too many outliers (aroun…