andysingal commented 1 year ago

Hi Marteen, I was working on :

from umap import UMAP
from bertopic import BERTopic

# Using a custom UMAP model
umap_model = UMAP(n_neighbors=15, n_components=2, min_dist=0.0, metric='cosine', random_state=42)

# Train our model
topic_model = BERTopic(umap_model=umap_model)

i am trying: Topics generated with c-TF-IDF serve as a good first ranking of words with respect to their topic. In this section, these initial rankings of words can be considered candidate keywords for a topic as we might change their rankings based on any representation model.

# Save original representations
from copy import deepcopy
original_topics = deepcopy(topic_model.topic_representations_)

def topic_differences(model, original_topics, max_length=75, nr_topics=10):
  """ For the first 10 topics, show the differences in 
  topic representations between two models """
  for topic in range(nr_topics):

    # Extract top 5 words per topic per model
    og_words = " | ".join(list(zip(*original_topics[topic]))[0][:5])
    new_words = " | ".join(list(zip(*model.get_topic(topic)))[0][:5])

    # Print a 'before' and 'after'
    whitespaces = " " * (max_length - len(og_words))
    print(f"Topic: {topic}    {og_words}{whitespaces}-->     {new_words}")

Further i tried: 
# KeyBERTInspired
from bertopic.representation import KeyBERTInspired
from bertopic import BERTopic
representation_model = KeyBERTInspired()

# Update our topic representations
new_topic_model = BERTopic(representation_model=representation_model).fit(sentences)
(i got the idea from : https://zenodo.org/record/7987071)

# Show topic differences
topic_differences(topic_model, new_topic_model)

but getting error:

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ in <cell line: 10>:10                                                                            │
│ in topic_differences:7                                                                           │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
TypeError: 'BERTopic' object is not subscriptable

Here are my Questions as follows: --- Please advise on how to fix it. Additionally, what are best practises to do attention on topics. --- Do you prefer cleaning and removing of stopwords?, I hope you can add a page with best practices. --- Additionally, can you share dataset: maartengr/arxiv_nlp on huggingface?

Thanks Again!!

MaartenGr commented 1 year ago

--- Please advise on how to fix it. Additionally, what are best practises to do attention on topics.

The topic_differences function is not correctly used. You should supply it with the topics themselves and not the topic model. Where did you find the code of that exactly?

--- Do you prefer cleaning and removing of stopwords?, I hope you can add a page with best practices. Generally, I would not clean or remove stopwords beforehand but using the CountVectorizer instead.

--- Additionally, can you share dataset: maartengr/arxiv_nlp on huggingface?

Which dataset are you exactly referring to? I believe you can already find it here.

andysingal commented 1 year ago

Thanks for your reply, I found the code in your book: Hands-On Large Language Models, it is missing code for :new_topic_model def topic_differences(model, original_topics, max_length=75, nr_topics=10): """ For the first 10 topics, show the differences in topic representations between two models """ for topic in range(nr_topics):

# Extract top 5 words per topic per model
og_words = " | ".join(list(zip(*original_topics[topic]))[0][:5])
new_words = " | ".join(list(zip(*model.get_topic(topic)))[0][:5])

# Print a 'before' and 'after'
whitespaces = " " * (max_length - len(og_words))
print(f"Topic: {topic}    {og_words}{whitespaces}-->     {new_words}")

But your code is missing for new_model:

KeyBERTInspired

from bertopic.representation import KeyBERTInspired representation_model = KeyBERTInspired()

Update our topic representations

new_topic_model.update_topics(abstracts, representation_model=representation_model)

Show topic differences (Here the topic_differences is used)

topic_differences(topic_model, new_topic_model)

for: --- Additionally, can you share dataset: maartengr/arxiv_nlp on huggingface?

Which dataset are you exactly referring to? I believe you can already find it here.

in the book you mentioned a dataset, which is not available on huggingface

Looking forward to hear from you.

MaartenGr commented 1 year ago

Aaah, that makes sense! Keep in mind that it is still a very early release, and as you might have noticed, there are still things that need to be fixed!

Having said that, you should run it as follows:

topic_differences(new_topic_model , original_topics)

That way, it will compare the topic model you created new_topic_model with the original topics original_topics.

The dataset I used there will be updated quite frequently, so I hadn't uploaded it yet. I definitely should fix that! For now, you can use the ArXiv dataset on kaggle and if you want to filter out the NLP stuff, you can run the following:

import json
from tqdm import tqdm
import re
# https://arxiv.org/help/api/user-manual
category_map = {
# 'cs.AI': 'Artificial Intelligence',
'cs.CL': 'Computation and Language',
# 'cs.CV': 'Computer Vision and Pattern Recognition',
# 'cs.LG': 'Machine Learning',
# 'stat.ML': 'Machine Learning'
}
year_pattern = r'([1-2][0-9]{3})'
year_pattern = r"\d{4}"

data_file = '../input/arxiv/arxiv-metadata-oai-snapshot.json'

def get_metadata():
    with open(data_file, 'r') as f:
        for line in f:
            yield line

titles = []
abstracts = []
years = []
categories = []
refs = []
metadata = get_metadata()
for index, paper in enumerate(tqdm(metadata)):
    paper = json.loads(paper)
    ref = paper.get('journal-ref')

    if not ref:
        ref = paper.get('update_date')

    # try to extract year
    if ref:
        year = re.findall(year_pattern, ref)
        if year:
            year = [int(i) for i in year if int(i) < 2024 and int(i) >= 1991]
            if year == []:
                year = None
            else:
                year = min(year)
    else:

        break
        year = None

    try:
        if year:            
            categories.append(category_map[paper.get('categories').split(" ")[0]])
            years.append(year)
            titles.append(paper.get('title'))
            abstracts.append(paper.get('abstract'))
            refs.append(ref)
    except:
        continue

    if index % 100_000 == 0:
        print(len(titles))
len(titles), len(abstracts), len(years), len(categories), len(refs)

I'll let you know when I have updated the dataset for the book!

andysingal commented 1 year ago

Thanks for sharing the code to produce the dataset. yes, i am using:

Save original representations

from copy import deepcopy original_topics = deepcopy(topic_model.topicrepresentations) (Note: topic_model = BERTopic(umap_model=umap_model))

PLEASE HELP HERE:

but "I would like to know what is **new_model** in your code????"
you forgot to mention "new_model"... can you share what is new_model???

representation_model = KeyBERTInspired()

# Update our topic representations
new_topic_model.update_topics(abstracts, representation_model=representation_model)(what is new_model????)

KeyBERTInspired

from bertopic.representation import KeyBERTInspired representation_model = KeyBERTInspired()

Update our topic representations

new_topic_model.update_topics(abstracts, representation_model=representation_model)

Show topic differences

topic_differences(topic_model, new_topic_model)

MaartenGr commented 1 year ago

Ah right, you can replace new_topic_model with topic_model and then it should work.

The section that you refer to shows different ways of improving the original topic representations. So you first create an initial model, namely topic_model and then you update that with one of the mentioned representation models.

andysingal commented 1 year ago

Thank you very much it worked now :)

# KeyBERTInspired
from bertopic.representation import KeyBERTInspired
from bertopic import BERTopic
representation_model = KeyBERTInspired()

# Update our topic representations
topic_model.update_topics(sentences, representation_model=representation_model)

# Show topic differences
topic_differences(topic_model, original_topics)

Output:
Topic: 0    groups | group | finite | abstract | prove                                 -->     groups | subgroups | group | subgroup | abstract
Topic: 1    neural | learning | deep | networks | network                              -->     cnns | cnn | rnns | neural | recognition
Topic: 2    type | program | programming | programs | logic                            -->     compiler | programming | interpreter | syntax | programs
Topic: 3    estimator | estimation | distribution | estimators | models                -->     estimating | models | estimation | estimators | empirical
Topic: 4    graph | algorithm | graphs | problem | time                                -->     graphs | algorithms | nodes | graph | algorithm
Topic: 5    abstract | 48th | proc | mit | franckymitedu                               -->     abstract | abstracts | mit | 2016 | acm
Topic: 6    policy | learning | reinforcement | control | robot                        -->     reinforcement | robotics | planning | robot | controllers
Topic: 7    channel | mimo | channels | fading | performance                           -->     mimo | transmit | multiplexing | channels | 5g
Topic: 8    control | consensus | multiagent | agents | systems                        -->     multiagent | controllability | cooperative | synchronization | distributed
Topic: 9    problem | algorithm | crossover | evolutionary | routing                   -->     metaheuristic | optimisation | algorithm | algorithms | heuristic

Have a good day Sir!! God Bless you!! Thanks Again!!

MaartenGr commented 1 year ago

That's very kind of you! If you ever have any questions or comments, feel free to reach out 😄

MaartenGr / BERTopic