Closed andysingal closed 9 months ago
--- Please advise on how to fix it. Additionally, what are best practises to do attention on topics.
The topic_differences
function is not correctly used. You should supply it with the topics themselves and not the topic model. Where did you find the code of that exactly?
--- Do you prefer cleaning and removing of stopwords?, I hope you can add a page with best practices. Generally, I would not clean or remove stopwords beforehand but using the CountVectorizer instead.
--- Additionally, can you share dataset: maartengr/arxiv_nlp on huggingface?
Which dataset are you exactly referring to? I believe you can already find it here.
Thanks for your reply, I found the code in your book: Hands-On Large Language Models, it is missing code for :new_topic_model def topic_differences(model, original_topics, max_length=75, nr_topics=10): """ For the first 10 topics, show the differences in topic representations between two models """ for topic in range(nr_topics):
# Extract top 5 words per topic per model
og_words = " | ".join(list(zip(*original_topics[topic]))[0][:5])
new_words = " | ".join(list(zip(*model.get_topic(topic)))[0][:5])
# Print a 'before' and 'after'
whitespaces = " " * (max_length - len(og_words))
print(f"Topic: {topic} {og_words}{whitespaces}--> {new_words}")
But your code is missing for new_model:
from bertopic.representation import KeyBERTInspired representation_model = KeyBERTInspired()
new_topic_model.update_topics(abstracts, representation_model=representation_model)
topic_differences(topic_model, new_topic_model)
for: --- Additionally, can you share dataset: maartengr/arxiv_nlp on huggingface?
Which dataset are you exactly referring to? I believe you can already find it here.
in the book you mentioned a dataset, which is not available on huggingface
Looking forward to hear from you.
Aaah, that makes sense! Keep in mind that it is still a very early release, and as you might have noticed, there are still things that need to be fixed!
Having said that, you should run it as follows:
topic_differences(new_topic_model , original_topics)
That way, it will compare the topic model you created new_topic_model
with the original topics original_topics
.
The dataset I used there will be updated quite frequently, so I hadn't uploaded it yet. I definitely should fix that! For now, you can use the ArXiv dataset on kaggle and if you want to filter out the NLP stuff, you can run the following:
import json
from tqdm import tqdm
import re
# https://arxiv.org/help/api/user-manual
category_map = {
# 'cs.AI': 'Artificial Intelligence',
'cs.CL': 'Computation and Language',
# 'cs.CV': 'Computer Vision and Pattern Recognition',
# 'cs.LG': 'Machine Learning',
# 'stat.ML': 'Machine Learning'
}
year_pattern = r'([1-2][0-9]{3})'
year_pattern = r"\d{4}"
data_file = '../input/arxiv/arxiv-metadata-oai-snapshot.json'
def get_metadata():
with open(data_file, 'r') as f:
for line in f:
yield line
titles = []
abstracts = []
years = []
categories = []
refs = []
metadata = get_metadata()
for index, paper in enumerate(tqdm(metadata)):
paper = json.loads(paper)
ref = paper.get('journal-ref')
if not ref:
ref = paper.get('update_date')
# try to extract year
if ref:
year = re.findall(year_pattern, ref)
if year:
year = [int(i) for i in year if int(i) < 2024 and int(i) >= 1991]
if year == []:
year = None
else:
year = min(year)
else:
break
year = None
try:
if year:
categories.append(category_map[paper.get('categories').split(" ")[0]])
years.append(year)
titles.append(paper.get('title'))
abstracts.append(paper.get('abstract'))
refs.append(ref)
except:
continue
if index % 100_000 == 0:
print(len(titles))
len(titles), len(abstracts), len(years), len(categories), len(refs)
I'll let you know when I have updated the dataset for the book!
Thanks for sharing the code to produce the dataset. yes, i am using:
from copy import deepcopy original_topics = deepcopy(topic_model.topicrepresentations) (Note: topic_model = BERTopic(umap_model=umap_model))
PLEASE HELP HERE:
but "I would like to know what is **new_model** in your code????"
you forgot to mention "new_model"... can you share what is new_model???
representation_model = KeyBERTInspired()
# Update our topic representations
new_topic_model.update_topics(abstracts, representation_model=representation_model)(what is new_model????)
from bertopic.representation import KeyBERTInspired representation_model = KeyBERTInspired()
new_topic_model.update_topics(abstracts, representation_model=representation_model)
topic_differences(topic_model, new_topic_model)
Ah right, you can replace new_topic_model
with topic_model
and then it should work.
The section that you refer to shows different ways of improving the original topic representations. So you first create an initial model, namely topic_model
and then you update that with one of the mentioned representation models.
Thank you very much it worked now :)
# KeyBERTInspired
from bertopic.representation import KeyBERTInspired
from bertopic import BERTopic
representation_model = KeyBERTInspired()
# Update our topic representations
topic_model.update_topics(sentences, representation_model=representation_model)
# Show topic differences
topic_differences(topic_model, original_topics)
Output:
Topic: 0 groups | group | finite | abstract | prove --> groups | subgroups | group | subgroup | abstract
Topic: 1 neural | learning | deep | networks | network --> cnns | cnn | rnns | neural | recognition
Topic: 2 type | program | programming | programs | logic --> compiler | programming | interpreter | syntax | programs
Topic: 3 estimator | estimation | distribution | estimators | models --> estimating | models | estimation | estimators | empirical
Topic: 4 graph | algorithm | graphs | problem | time --> graphs | algorithms | nodes | graph | algorithm
Topic: 5 abstract | 48th | proc | mit | franckymitedu --> abstract | abstracts | mit | 2016 | acm
Topic: 6 policy | learning | reinforcement | control | robot --> reinforcement | robotics | planning | robot | controllers
Topic: 7 channel | mimo | channels | fading | performance --> mimo | transmit | multiplexing | channels | 5g
Topic: 8 control | consensus | multiagent | agents | systems --> multiagent | controllability | cooperative | synchronization | distributed
Topic: 9 problem | algorithm | crossover | evolutionary | routing --> metaheuristic | optimisation | algorithm | algorithms | heuristic
Have a good day Sir!! God Bless you!! Thanks Again!!
That's very kind of you! If you ever have any questions or comments, feel free to reach out 😄
Hi Marteen, I was working on :
but getting error:
Here are my Questions as follows: --- Please advise on how to fix it. Additionally, what are best practises to do attention on topics. --- Do you prefer cleaning and removing of stopwords?, I hope you can add a page with best practices. --- Additionally, can you share dataset: maartengr/arxiv_nlp on huggingface?
Thanks Again!!