MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
5.98k stars 751 forks source link

Supervised topic model generating different topics to training data #1983

Open morrisseyj opened 4 months ago

morrisseyj commented 4 months ago

I am trying to run a supervised topic model, but when i look at the results the model produces topic numbers that are different to those that i trained it on. Am I misunderstanding something. I thought the supervised model would produce the exact same results as the training data - i appreciate that results for test data will depend on the accuracy of the model.

Here is some sample code to show the problem.

# Note: I have a dataframe "combined_df_clean_doc_info" that contains the training docs and target topic numbers.

# Import the relevant libraries

from bertopic.vectorizers import ClassTfidfTransformer
from bertopic.dimensionality import BaseDimensionalityReduction
from sklearn.linear_model import LogisticRegression

 # Get the data for training the supervised model - the documents and the topic numbers

training_titles = combined_df_clean_doc_info["Document"].to_list()
training_topic_numbers = combined_df_clean_doc_info["Topic"].to_list()

# Skip over dimensionality reduction, replace cluster model with classifier,
# and reduce frequent words while we are at it.

empty_dimensionality_model = BaseDimensionalityReduction()
clf = LogisticRegression()
ctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True)

# Create a fully supervised BERTopic instance
manual_topic_model= BERTopic(
        umap_model=empty_dimensionality_model,
        hdbscan_model=clf,
        ctfidf_model=ctfidf_model
)

topic = manual_topic_model.fit_transform(training_titles, y=training_topic_numbers)

Now I look to compare the model generated topic numbers with the original topic numbers:

pd.DataFrame({"training_title": training_titles, #i.e. the training titles
"training_topic_number": training_topic_numbers, #i.e. the training topics
"model_topic_title": manual_topic_model.get_document_info(training_titles)["Document"],
"model_topic_number": manual_topic_model.get_document_info(training_titles)["Topic"]})

Gives:


training_title              training_topic_number   model_topic_title           model_topic_number
0 !!CALL OSHA!! Oregon Amazon warehouse workers     4   !!CALL OSHA!! Oregon Amazon warehouse workers       5
1 " She described physical “misery” from walking... 4   " She described physical “misery” from walking...   5
2 "#PrimeDay makes this one of the most dangerou... 4   "#PrimeDay makes this one of the most dangerou...   5
3 "...Amazon workers say intense focus on speed ... 4   "...Amazon workers say intense focus on speed ...   5
4 "50 to 100" Amazon workers are trapped under r... 4   "50 to 100" Amazon workers are trapped under r...   5
... ... ... ... ...
8490 “It’s sheer slavery” Amazon warehouse worker i...  2   “It’s sheer slavery” Amazon warehouse worker i...   3
8491 “I’m an Amazon Warehouse Worker. This Crisis I...  2   “I’m an Amazon Warehouse Worker. This Crisis I...   3
8492 “The Only Amazon Prime Day Guide You’ll Need” ...  2   “The Only Amazon Prime Day Guide You’ll Need” ...   3
8493 “Why don’t you get a job at Amazon instead?”   -1  “Why don’t you get a job at Amazon instead?”        -1
8494 米Amazonの倉庫作業員8人が新型コロナで死亡       2   米Amazonの倉庫作業員8人が新型コロナで死亡            3

The reason behind doing all this, is that i am analyzing social media (Reddit) data (on Amazon). The data is full of re-posts that distort my clusters. So I generate unique posts before modelling. However i also want to look at (and topic model) the comments that flow from posts in each post-cluster. Some of those comments sit in re-posts that were initially excluded. So what i am trying to do here is generate the topic numbers for the full data (including the re-posts). The steps are essentially: Clean the data to get unique documents, model the topics (unsupervised), use the topics derived to generate a classifier (supervised), run the classifier on the whole dataset (i.e. including re-posts). My assumption was that all the training data would be correctly categorized and so would any "test data" that is identical to the training data.

What the above is showing me, however, is that the model is generating different topic classifications to the data it was trained on. This means that the "test" data won't be classified correctly. Is this expected behavior?

MaartenGr commented 4 months ago

Hmmm, it seems that the topics are correctly assigned but that they are re-ordered internally to make sure that the more frequent topics get lower values (they are sorted I think). So no worries, they are not actually different classifications just merely different IDs for the topics. This means that you can do one of two things. First, you can map training_topic_number to model_topic_number since, for instance, 4 always seems to map to 5. Second, I am not sure but I remember that manual BERTopic does not perform this sorting process.

morrisseyj commented 4 months ago

@MaartenGr thanks for the response. I had already sorted the data by count when assigning the topic numbers, so that couldn't have been the problem. Going back through and checking it turns out there was an error on my side with the way i had amalgamated topics earlier on. You explanation made me check more closely and catch this. Apologies for missing this the first time around and wasting your time. I really appreciate the library you have put together here.

MaartenGr commented 4 months ago

No problem! I'm just glad that your problem was resolved and thanks for the kind words 😄