Open clstaudt opened 11 months ago
Unfortunately, that is not easily possible without having to match the documents themselves with the representative documents. Other than that, you could take a look at the internal _extract_representative_docs
function that creates the representative documents. It returns a number of things among which the indices I believe.
Are the representative docs special (e.g. cluster centers or similar) or just random samples of documents from that topic?
They are calculated by taking a random subset (500 documents) from each cluster and calculating their c-TF-IDF representations. Then, their cosine similarity is calculated with respect to the topic c-TF-IDF matrices. The most similar documents are selected and a small diversity is applied to prevent duplicates. You can find the full code here:
What's wrong with this approach?:
-- Create Pandas DataFrame of single column; unique document id's assigned upstream, before BERTopic
-- Each document (row) has its own unique Doc_ID key
document_ids = document_df['Doc_ID']
-- Reset Pandas index to ensure the DataFrame index starts at zero
document_ids.reset_index(drop=True, inplace=True)
-- Calculate embeddings model -- Perform BERTopic on documents
-- Append resulting topics (Cluster ID) back to original documents
topics_df = pd.DataFrame(topics, column=['Topics'])
documents_ids['Topics'] = topics_df
--- or alternative method ---
new_df = pd.merge(document_ids, topics_df, left_index=True, right_index=True, how='inner')
-- LEFT JOIN 'new_df' back to original documents_df, on key = Doc_ID .... etc
I'm proposing this approach because, in my case, my documents actually go through a data-prepping and filtering process, where some document rows (sentences) don't survive to be processed downstream in BERTopic. This explains the reset_index()
step, because the original sequential indexing gets disrupted and disjoint along the way, where as the BERTopic Cluster Index is not disrupted.
That being said, I am curious the function docs.index()
function can also be used to append BERTopic results back to the original documents dataframe, for each separate document (row).
What's wrong with this approach?:
I'm not seeing any error but this might work. Do note that I purposefully showcased the _extract_representative_docs
method since that does not need strings to be matched. Preprocessing here should not be relevant since we are merely interested in the indices that are returned, which you can then match to your original documents that you used before processing.
The
get_topic_info()
method returns a dataframe with the columnRepresentative_Docs
, in which we find the content of documents as strings. How can I link them back to the training set? Can I retrieve their index in the list of training documents?