Closed josepius-clemson closed 1 year ago
Hi,
just to clarify: do you want to use embeddings of text coming from OpenAI or Cohere? If so, you can load custom embeddings in CTMs. We also support custom numpy embeddings.
We currently don't have "an automatic" embedding method for that
I see that. Thanks. I was looking for an option to generate topic name given the the words in the topic. Is there an option for that? Also is there an option to get n-gram output in the topics?
I was looking for an option to generate topic name given the the words in the topic. Is there an option for that?
no unfortunately we currently do not support that
Also is there an option to get n-gram output in the topics?
You should be able to manually preprocess your input text in such a way that it contains bigrams, so then the topic model will consider these as actual unique tokens
Thanks for your input. I will try that.
I have a few other questions. I hope it's fine to ask here. Is it possible to train the model with two GPUs? Is there an option to partial fit so that I can batch process the data? I am currently getting a memory allocation error when I input the full dataset.
On Thu, Mar 2, 2023 at 4:26 PM Federico Bianchi @.***> wrote:
I was looking for an option to generate topic name given the the words in the topic. Is there an option for that?
no unfortunately we currently do not support that
Also is there an option to get n-gram output in the topics?
You should be able to manually preprocess your input text in such a way that it contains bigrams, so then the topic model will consider these as actual unique tokens
— Reply to this email directly, view it on GitHub https://github.com/MilaNLProc/contextualized-topic-models/issues/128#issuecomment-1452571686, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQZ2FUMNUOH2DYYF4WZE6WLW2EGB3ANCNFSM6AAAAAAVNYPGVE . You are receiving this because you authored the thread.Message ID: @.***>
-- Best regards, Jose Pius Nedumkallel PhD Candidate, Department of Management, Wilbur O. and Ann Powers College of Business, Clemson University, SC, USA
Is it possible to train the model with two GPUs
unfortunately no, the model is currently single GPU based
Is there an option to partial fit so that I can batch process the data?
Could you elaborate more about the partial fit? the model is already trained in batches, you can reduce batch size in the configuration if that is too large. Additional questions: how large is the vocab
of your data?
Partial fit: I was looking for an option like this https://maartengr.github.io/BERTopic/getting_started/online/online.html#example
len(docs) is 1627591 len(tp.vocab) is 2000 How to adjust the batch size ?
Regarding: Bigrams When I convert the inputs text as below, I am getting error [('Horrible', 'app'), ('app', '!'), ('!', 'I'), ('I', 'created'), ('created', 'my'), ('my', 'account'), ('account', 'but'), ('but', 'it'), ('it', 'wont'), ('wont', 'take'), ('take', 'me'), ('me', 'past'), ('past', 'the'), ('the', 'email/phone'), ('email/phone', 'verification'), ('verification', 'screen'), ('screen', '.'), ('.', 'Ive'), ('Ive', 'tried'), ('tried', 'everything'), ('everything', ','), (',', 'including'), ('including', 'un-installing'), ('un-installing', 'and'), ('and', 're-installing'), ('re-installing', ','), (',', 'but'), ('but', 'same'), ('same', 'thing'), ('thing', '.'), ('.', 'So'), ('So', 'frustrating'), ('frustrating', '!'), ('!', 'Get'), ('Get', 'it'), ('it', 'figured'), ('figured', 'out'), ('out', 'please'), ('please', '!'), ('!', '!'), ('!', '!'), ('!', '!')] ----> 6 preprocessed_documents, unpreprocessed_corpus, vocab, retained_indices = sp.preprocess() AttributeError: 'list' object has no attribute 'lower'
If you want to have a custom preprocessing pipeline, you cannot use the standard preprocessing.
It should be pretty easy to do: you first need to generate the contextualized representations and the bow representations manually. This is just an example, you need to adapt it to your pipeline:
vectorizer = CountVectorizer() #from sklearn
train_bow_embeddings = vectorizer.fit_transform(text_for_bow)
train_contextualized_embeddings = bert_embeddings_from_list(text_for_contextual, "chosen_contextualized_model")
vocab = vectorizer.get_feature_names_out()
id2token = {k: v for k, v in zip(range(0, len(vocab)), vocab)}
Then you can then instantiate the CTMDataset class and use this for training.
Let me know if you need help!
forgot to reply about batch size, you can define it in the main class https://github.com/MilaNLProc/contextualized-topic-models/blob/master/contextualized_topic_models/models/ctm.py#L21
Can you tell me how to instantiate the CTMDataset and use that for training ? I am getting error when I tried to pass id2token to TopicModelDataPreparation() and I did not understand how to prepare training dataset with id2token. Pls help
Hi!
If you use your own preprocessing you can skip the TopicModelDataPreparation part! Let's assume you have
text_for_bow = ["your_first document", "a second document in_the_collection"]
text_for_contextual = ["your first document", "a second document in the collection"]
You can use this to generate the components to instantiate CTMDataset (you will find the definition of bert_embeddings_from_list
here.
vectorizer = CountVectorizer() #from sklearn
train_bow_embeddings = vectorizer.fit_transform(text_for_bow)
train_contextualized_embeddings = bert_embeddings_from_list(text_for_contextual, "chosen_contextualized_model")
vocab = vectorizer.get_feature_names_out()
id2token = {k: v for k, v in zip(range(0, len(vocab)), vocab)}
Once you have these elements, you should be able to create a dataset object and use it for training
training_dataset = CTMDataset(train_contextualized_embeddings, train_bow_embeddings, id2token, labels=None):
ctm = CombinedTM(bow_size=len(qt.vocab), contextual_size=768, n_components=50) # 50 topics
ctm.fit(training_dataset) # run the model
Code:
from sklearn.feature_extraction.text import CountVectorizer
from contextualized_topic_models.utils.data_preparation import bert_embeddings_from_list
vectorizer = CountVectorizer() #from sklearn
train_bow_embeddings = vectorizer.fit_transform(preprocessed_documents)
train_contextualized_embeddings = bert_embeddings_from_list(unpreprocessed_corpus, "bert-base-nli-mean-tokens")
vocab = vectorizer.get_feature_names_out()
id2token = {k: v for k, v in zip(range(0, len(vocab)), vocab)}
TypeError Traceback (most recent call last)
text_for_bow = ["your_first document", "a second document in_the_collection"]
text_for_contextual = ["your first document", "a second document in the collection"]
vectorizer = CountVectorizer() #from sklearn
train_bow_embeddings = vectorizer.fit_transform(text_for_bow)
train_contextualized_embeddings = bert_embeddings_from_list(text_for_contextual, "bert-base-nli-mean-tokens")
vocab = vectorizer.get_feature_names_out()
id2token = {k: v for k, v in zip(range(0, len(vocab)), vocab)}
TypeError Traceback (most recent call last)
yes, you need to choose an embedding model (chosen_contextualized_model
is a placeholder)
also please consider using code formatting, otherwise, it's hard to read
Sorry, I missed to put the embedding model. I just updated the error I am getting in the previous to messages.
bert_embeddings_from_list(text_for_contextual, "bert-base-nli-mean-tokens", max_seq_length=200)
should fix the issue. Note that the max seq length depend on the model
Yeah. It worked thanks. I hope I should give bow_size=len(vocab) I ran the following code and it worked. Just to make sure this the correct code for training bi-grams:
from contextualized_topic_models.utils.data_preparation import TopicModelDataPreparation
from sklearn.feature_extraction.text import CountVectorizer
from contextualized_topic_models.utils.data_preparation import bert_embeddings_from_list
sp = WhiteSpacePreprocessingStopwords(docs, stopwords_list=stopwords)
preprocessed_documents, unpreprocessed_corpus, vocab, retained_indices = sp.preprocess()
vectorizer = CountVectorizer(ngram_range=(2,2)) #from sklearn
train_bow_embeddings = vectorizer.fit_transform(preprocessed_documents)
train_contextualized_embeddings = bert_embeddings_from_list(unpreprocessed_corpus, "bert-base-nli-mean-tokens",max_seq_length=200)
vocab = vectorizer.get_feature_names_out()
id2token = {k: v for k, v in zip(range(0, len(vocab)), vocab)}
from contextualized_topic_models.datasets.dataset import CTMDataset
import numpy as np
np.random.seed(42)
training_dataset = CTMDataset(train_contextualized_embeddings, train_bow_embeddings, id2token, labels=None)
ctm = CombinedTM(bow_size=len(vocab), contextual_size=768, n_components=12,num_epochs=25) # 12 topics
ctm.fit(training_dataset) # run the model
ctm.get_topic_lists(15)
That should do the work I think!
Is it possible to get a representation of the topics using advanced models of OpenAI ,Cohere etc?