MaartenGr / BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.
https://maartengr.github.io/BERTopic/
MIT License
5.99k stars 751 forks source link

Topic modelling for corpus comparison ('topics per class' on steroids?) #1786

Open sdspieg opened 7 months ago

sdspieg commented 7 months ago

Has anybody ever looked into this? If so, would you be willing to share your experience?

Let me share a concrete use case of my own. I am currently playing with Parlamint 4.0, a dataset with the parliamentary debates of 26 European parliaments since ~1996 (coverage varies by country). It includes the full text of all contributions as well as a bunch of metadata - including, for instance, the speakers, their political parties and their political orientation (e.g. left, centre, right; or even 'right-to-far-right' vs 'far right') . One version of the dataset even contains English translations of everything. I have first of all filtered out all paragraphs that are relevant for my research topic (Russia) with a regex-based keyword search. I have then extracted key noun phrases from the paragraph text using spaCy' s 'en_core_web_lg' language model for the PoS parsing for KeyphraseCountVectorizer; the 'all-mpnet-base-v2' language model for the SentenceTransformer; and then just matplotlib and seaborn for the vizzes. Here's an example viz - it shows the normalized salience of these keyphrases for one random country (in this case the UK) per year ("what percentage of all paragraphs are Russia-relevant for that country and how has that changed over time?") Austria_Top_Keywords_Heatmap

So yes, this is a 'dense' viz, BUT it is (IMO) still visually more 'chewable' for more than just a few ones than a line graph...

I am currently running BERTopic (using these KeyBERT key phrases) on my entire 'Russia'-corpus with the usual visualization options. I will then use some LLM to label these topics more 'intelligently' and to write a few paragraphs on them (for a write-up). I will definitely also (try to) make a landscape view like in this (CiteSpace) viz: inflation+recession

But my question concerns comparing topics across subcorpora. I'll definitely generate a heatmap like the one I inserted here. [and btw - if Maarten reads this, I'd be happy to share my code with him]. The heatmap I shared here had keyphrases in the y-axis - this one will have topic labels, and you would see the relative salience of these topics over time. I then also plan to run it on each individual country ("what are the most salient Russia-related topics that are debated in country a?"); on the 'far right' and 'far left' political parties; - both overall and also for selected countries. I might think of more ones.

So the way I see it, for comparing subcorpora, we currently have the interactive horizontal 'Topic per class' histogram viz that's part of BERTopic (and it's great!) already; and we have the Heatmap viz I proposed here (it's not so great, but it DOES 'show' more of the big picture). But can anybody think of other ways of comparing the results of topic modelling across classes AND across time? Sorry for being so long-winded, but I'd be really grateful if somebody were willing to share some ideas.

MaartenGr commented 7 months ago

Thanks for sharing this! Definitely some interesting ideas here. I think it all boils down to how many additional dimensions you want to add to a single-topic representation. For instance, dynamic and class-based topic modeling all add a single additional dimension to visualize and analyze. The more dimensions, the more difficult the resulting visualization will be and the more dependent it also will be on the type of dimension (discrete vs. continuous).

By the way, from a statistical perspective, this also relates to #360 where similar forms of visualizations and statistical representation were explored.

All in all, I am definitely open to a more dynamic way of representing more dimensions following the methods in class-based and dynamic topic modeling. However, since I do not work on BERTopic full-time but merely in the late hours, I have limited time to approach these kinds of visualizations. If you have any concrete suggestions, I'm all ears.