MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
5.99k stars 752 forks source link

How to retrieve indexes of representative docs? #1576

Open clstaudt opened 11 months ago

clstaudt commented 11 months ago

The get_topic_info() method returns a dataframe with the column Representative_Docs, in which we find the content of documents as strings. How can I link them back to the training set? Can I retrieve their index in the list of training documents?

MaartenGr commented 11 months ago

Unfortunately, that is not easily possible without having to match the documents themselves with the representative documents. Other than that, you could take a look at the internal _extract_representative_docs function that creates the representative documents. It returns a number of things among which the indices I believe.

clstaudt commented 11 months ago

Are the representative docs special (e.g. cluster centers or similar) or just random samples of documents from that topic?

MaartenGr commented 11 months ago

They are calculated by taking a random subset (500 documents) from each cluster and calculating their c-TF-IDF representations. Then, their cosine similarity is calculated with respect to the topic c-TF-IDF matrices. The most similar documents are selected and a small diversity is applied to prevent duplicates. You can find the full code here:

https://github.com/MaartenGr/BERTopic/blob/62e97ddea6cdcf9e4da25f9eaed478b22a9f9e20/bertopic/_bertopic.py#L3441

taylorshobe commented 5 months ago

What's wrong with this approach?:

-- Create Pandas DataFrame of single column; unique document id's assigned upstream, before BERTopic -- Each document (row) has its own unique Doc_ID key document_ids = document_df['Doc_ID'] -- Reset Pandas index to ensure the DataFrame index starts at zero document_ids.reset_index(drop=True, inplace=True)

-- Calculate embeddings model -- Perform BERTopic on documents

-- Append resulting topics (Cluster ID) back to original documents

topics_df = pd.DataFrame(topics, column=['Topics'])
documents_ids['Topics'] = topics_df

--- or alternative method --- new_df = pd.merge(document_ids, topics_df, left_index=True, right_index=True, how='inner')

-- LEFT JOIN 'new_df' back to original documents_df, on key = Doc_ID .... etc

I'm proposing this approach because, in my case, my documents actually go through a data-prepping and filtering process, where some document rows (sentences) don't survive to be processed downstream in BERTopic. This explains the reset_index() step, because the original sequential indexing gets disrupted and disjoint along the way, where as the BERTopic Cluster Index is not disrupted.

That being said, I am curious the function docs.index() function can also be used to append BERTopic results back to the original documents dataframe, for each separate document (row).

MaartenGr commented 5 months ago

What's wrong with this approach?:

I'm not seeing any error but this might work. Do note that I purposefully showcased the _extract_representative_docs method since that does not need strings to be matched. Preprocessing here should not be relevant since we are merely interested in the indices that are returned, which you can then match to your original documents that you used before processing.