MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
5.79k stars 721 forks source link

get_representative_docs #245

Closed doubianimehdi closed 2 years ago

doubianimehdi commented 2 years ago

Hi,

Here's the error I have :

TypeError Traceback (most recent call last) ~\AppData\Local\Temp/ipykernel_13832/1322936418.py in ----> 1 representative_docs = topic_model.get_representative_docs()

TypeError: get_representative_docs() missing 1 required positional argument: 'topic'

Moreover when I specify a specific topic I have : ["Gram-Scale Syntheses and Conductivities of [10]Cycloparaphenylene and Its Tetraalkoxy Derivatives [10]Cycloparaphenylene ([10]CPP) and its tetraalkoxy derivatives were synthesized on the gram scale in 7 steps starting from 1,4-benzoquinone or 2,5-dialkoxy-1,4-benzoquinone. The key steps involve the highly cis-selective bis-addition of 4-bromo-4'-lithiobiphenyl to the quinones to produce a five-ring unit containing cyclohexa-1,4-diene-3,6-diol moiety, the platinum-mediated dimerization of the five-ring unit, and the H2SnCl4-mediated reductive aromatization of cyclohexadienediol. The tetraalkoxy substituents increased the solubility of [10]CPP in common organic solvents. The carrier transport properties of thin films of [10]CPP and its derivatives were measured for the first time and indicated that [10]CPP derivatives could rival phenyl-C-61-butyric acid methyl ester, which is used widely as an n-type active layer in bulk heterojunction photovoltaics.", "Synthesis and Structures of Zigzag Shaped [12]Cyclo-p-phenylene Composed of Dinaphthofuran Units and Biphenyl Units A [12]Cyclo-p-phenylene 9 composed of dinaphthofuran units and biphenyl units was synthesized through reductive elimination of the corresponding trinuclear complex by applying Yamago's method. The X-ray crystallographic analyses of 9 revealed that it adopts a zigzag conformation in the solid state. The UV-vis and fluorescence measurements of compound 9 indicated that it also preferentially took a zigzag conformation in the solution state.", 'Synthesis and Characterization of [5]Cycloparaphenylene The synthesis of highly strained [5]cycloparaphenylene ([5]CPP), a structural unit of the periphery of C-60 and the shortest possible structural constituent of the sidewall of a (5,5) carbon nanotube, was achieved in nine steps in 17% overall yield. The synthesis relied on metal-mediated ring closure of a triethylsilyl (TES)-protected masked precursor 1c followed by removal of the TES groups and subsequent reductive aromatization. UV-vis and electrochemical studies revealed that the HOMO-LUMO gap of [5]CPP is narrow and is comparable to that of C-60, as predicted by theoretical calculations. The results suggest that [5]CPP should be an excellent lead compound for molecular electronics.']

From what I understand it's the top 3 (If I refer to the semi colons) but there's no clear separation between the documents and I would like to have a score (preferably in percentage) of similiarity to the topic.

Thanks for your amazing work !

nilsblessing commented 2 years ago

Regarding the TypeError you get I would say you need to pass a topic id (integer) in which your are interested in. Lets say you want to get representative documents for topic 1 you can run topic_model.get_representative_docs(1) in order to get the top 3 representative documents for topic id 1.

best Nils

doubianimehdi commented 2 years ago

Hi Nilsblessing,

I understand that, but what I want , and as per the documentation , I would like the same for all the topics and the corresponding metric ...

Thanks !

nilsblessing commented 2 years ago

If I understood correctly, you would like to receive the representative documents for each of your topics as well as the metrics for the documents or how close they are to the topic.

best Nils

MaartenGr commented 2 years ago

From what I understand it's the top 3 (If I refer to the semi colons) but there's no clear separation between the documents and I would like to have a score (preferably in percentage) of similiarity to the topic.

There currently is not a specific score that relates to the similarity of the document to the topic. The reasoning for this is that the representative_docs function is based upon the exemplars function of HDBSCAN. Defining distance (or similarity) is quite difficult and interpreting them even more so, especially when you have clusters of strange shapes. Imagine an S-shaped cluster, what would distance/similarity actually mean? For those reasons, I did not add scores to the output.

gsalfourn commented 2 years ago

In addition to the previous responses, not sure if this is something that might be helpful to you, but you could try this approach that involves a series of steps, post saving the probabilities.

topic_model = BERTopic(calculate_probabilities=True, verbose=False)

# train model, extract topics and probabilities
topics, probabilities = topic_model.fit_transform(docs)

You could then use the following steps to get the information you want

# topic info for top 20 topics
topic_model.get_topic_info().head(20))  

# extract the topic names
top_names = topic_model.topic_names

- convert topic names from dict to df

# extract representative docs for all topics
rep_docs = topic_model.representative_docs

- convert rep docs from dict to df

# get topics with probabilities
top_probs = topic_model.get_topics()

- convert topic probs from dict to df

You can then use pandas to merge all three variables based on topic number, because all will give you information on topic numbers (topic_num)

output = pd.merge(top_names, 
                rep_docs, 
                how='left', 
                left_on='topic_num', 
                right_on='topic_num')
doubianimehdi commented 2 years ago

Thanks everyone for your useful answers, But I think there's a misunderstanding somewhere , what i'm trying to say is that rep_docs = topic_model.representative_docs() this doesn't work !

gsalfourn commented 2 years ago

try this

`rep_docs = topic_model.representative_docs

or that

rep_docs = topic_model.get_representative_docs()
MaartenGr commented 2 years ago

Due to inactivity, I'll be closing this issue. If you are still experiencing this issue, let me know and I'll reopen it!