Open morrisseyj opened 4 months ago
Hmmm, it seems that the topics are correctly assigned but that they are re-ordered internally to make sure that the more frequent topics get lower values (they are sorted I think). So no worries, they are not actually different classifications just merely different IDs for the topics. This means that you can do one of two things. First, you can map training_topic_number
to model_topic_number
since, for instance, 4 always seems to map to 5. Second, I am not sure but I remember that manual BERTopic does not perform this sorting process.
@MaartenGr thanks for the response. I had already sorted the data by count when assigning the topic numbers, so that couldn't have been the problem. Going back through and checking it turns out there was an error on my side with the way i had amalgamated topics earlier on. You explanation made me check more closely and catch this. Apologies for missing this the first time around and wasting your time. I really appreciate the library you have put together here.
No problem! I'm just glad that your problem was resolved and thanks for the kind words 😄
I am trying to run a supervised topic model, but when i look at the results the model produces topic numbers that are different to those that i trained it on. Am I misunderstanding something. I thought the supervised model would produce the exact same results as the training data - i appreciate that results for test data will depend on the accuracy of the model.
Here is some sample code to show the problem.
Now I look to compare the model generated topic numbers with the original topic numbers:
Gives:
The reason behind doing all this, is that i am analyzing social media (Reddit) data (on Amazon). The data is full of re-posts that distort my clusters. So I generate unique posts before modelling. However i also want to look at (and topic model) the comments that flow from posts in each post-cluster. Some of those comments sit in re-posts that were initially excluded. So what i am trying to do here is generate the topic numbers for the full data (including the re-posts). The steps are essentially: Clean the data to get unique documents, model the topics (unsupervised), use the topics derived to generate a classifier (supervised), run the classifier on the whole dataset (i.e. including re-posts). My assumption was that all the training data would be correctly categorized and so would any "test data" that is identical to the training data.
What the above is showing me, however, is that the model is generating different topic classifications to the data it was trained on. This means that the "test" data won't be classified correctly. Is this expected behavior?