MilaNLProc / contextualized-topic-models

A python package to run contextualized topic modeling. CTMs combine contextualized embeddings (e.g., BERT) with topic models to get coherent topics. Published at EACL and ACL 2021 (Bianchi et al.).
MIT License
1.21k stars 147 forks source link

representation embedding #128

Closed josepius-clemson closed 1 year ago

josepius-clemson commented 1 year ago

Is it possible to get a representation of the topics using advanced models of OpenAI ,Cohere etc?

vinid commented 1 year ago

Hi,

just to clarify: do you want to use embeddings of text coming from OpenAI or Cohere? If so, you can load custom embeddings in CTMs. We also support custom numpy embeddings.

We currently don't have "an automatic" embedding method for that

josepius-clemson commented 1 year ago

I see that. Thanks. I was looking for an option to generate topic name given the the words in the topic. Is there an option for that? Also is there an option to get n-gram output in the topics?

vinid commented 1 year ago

I was looking for an option to generate topic name given the the words in the topic. Is there an option for that?

no unfortunately we currently do not support that

Also is there an option to get n-gram output in the topics?

You should be able to manually preprocess your input text in such a way that it contains bigrams, so then the topic model will consider these as actual unique tokens

josepius-clemson commented 1 year ago

Thanks for your input. I will try that.

I have a few other questions. I hope it's fine to ask here. Is it possible to train the model with two GPUs? Is there an option to partial fit so that I can batch process the data? I am currently getting a memory allocation error when I input the full dataset.

On Thu, Mar 2, 2023 at 4:26 PM Federico Bianchi @.***> wrote:

I was looking for an option to generate topic name given the the words in the topic. Is there an option for that?

no unfortunately we currently do not support that

Also is there an option to get n-gram output in the topics?

You should be able to manually preprocess your input text in such a way that it contains bigrams, so then the topic model will consider these as actual unique tokens

— Reply to this email directly, view it on GitHub https://github.com/MilaNLProc/contextualized-topic-models/issues/128#issuecomment-1452571686, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQZ2FUMNUOH2DYYF4WZE6WLW2EGB3ANCNFSM6AAAAAAVNYPGVE . You are receiving this because you authored the thread.Message ID: @.***>

-- Best regards, Jose Pius Nedumkallel PhD Candidate, Department of Management, Wilbur O. and Ann Powers College of Business, Clemson University, SC, USA

vinid commented 1 year ago

Is it possible to train the model with two GPUs

unfortunately no, the model is currently single GPU based

Is there an option to partial fit so that I can batch process the data?

Could you elaborate more about the partial fit? the model is already trained in batches, you can reduce batch size in the configuration if that is too large. Additional questions: how large is the vocab of your data?

josepius-clemson commented 1 year ago

Partial fit: I was looking for an option like this https://maartengr.github.io/BERTopic/getting_started/online/online.html#example

len(docs) is 1627591 len(tp.vocab) is 2000 How to adjust the batch size ?

josepius-clemson commented 1 year ago

Regarding: Bigrams When I convert the inputs text as below, I am getting error [('Horrible', 'app'), ('app', '!'), ('!', 'I'), ('I', 'created'), ('created', 'my'), ('my', 'account'), ('account', 'but'), ('but', 'it'), ('it', 'wont'), ('wont', 'take'), ('take', 'me'), ('me', 'past'), ('past', 'the'), ('the', 'email/phone'), ('email/phone', 'verification'), ('verification', 'screen'), ('screen', '.'), ('.', 'Ive'), ('Ive', 'tried'), ('tried', 'everything'), ('everything', ','), (',', 'including'), ('including', 'un-installing'), ('un-installing', 'and'), ('and', 're-installing'), ('re-installing', ','), (',', 'but'), ('but', 'same'), ('same', 'thing'), ('thing', '.'), ('.', 'So'), ('So', 'frustrating'), ('frustrating', '!'), ('!', 'Get'), ('Get', 'it'), ('it', 'figured'), ('figured', 'out'), ('out', 'please'), ('please', '!'), ('!', '!'), ('!', '!'), ('!', '!')] ----> 6 preprocessed_documents, unpreprocessed_corpus, vocab, retained_indices = sp.preprocess() AttributeError: 'list' object has no attribute 'lower'

vinid commented 1 year ago

If you want to have a custom preprocessing pipeline, you cannot use the standard preprocessing.

It should be pretty easy to do: you first need to generate the contextualized representations and the bow representations manually. This is just an example, you need to adapt it to your pipeline:

vectorizer = CountVectorizer() #from sklearn

train_bow_embeddings = vectorizer.fit_transform(text_for_bow)
train_contextualized_embeddings = bert_embeddings_from_list(text_for_contextual, "chosen_contextualized_model")
vocab = vectorizer.get_feature_names_out()
id2token = {k: v for k, v in zip(range(0, len(vocab)), vocab)}

Then you can then instantiate the CTMDataset class and use this for training.

Let me know if you need help!

vinid commented 1 year ago

forgot to reply about batch size, you can define it in the main class https://github.com/MilaNLProc/contextualized-topic-models/blob/master/contextualized_topic_models/models/ctm.py#L21

josepius-clemson commented 1 year ago

Can you tell me how to instantiate the CTMDataset and use that for training ? I am getting error when I tried to pass id2token to TopicModelDataPreparation() and I did not understand how to prepare training dataset with id2token. Pls help

vinid commented 1 year ago

Hi!

If you use your own preprocessing you can skip the TopicModelDataPreparation part! Let's assume you have

text_for_bow = ["your_first document", "a second document in_the_collection"]
text_for_contextual = ["your first document", "a second document in the collection"]

You can use this to generate the components to instantiate CTMDataset (you will find the definition of bert_embeddings_from_list here.

vectorizer = CountVectorizer() #from sklearn

train_bow_embeddings = vectorizer.fit_transform(text_for_bow)
train_contextualized_embeddings = bert_embeddings_from_list(text_for_contextual, "chosen_contextualized_model")
vocab = vectorizer.get_feature_names_out()
id2token = {k: v for k, v in zip(range(0, len(vocab)), vocab)}

Once you have these elements, you should be able to create a dataset object and use it for training

training_dataset = CTMDataset(train_contextualized_embeddings, train_bow_embeddings, id2token, labels=None):

ctm = CombinedTM(bow_size=len(qt.vocab), contextual_size=768, n_components=50) # 50 topics

ctm.fit(training_dataset) # run the model
josepius-clemson commented 1 year ago

Code:

from sklearn.feature_extraction.text import CountVectorizer
from contextualized_topic_models.utils.data_preparation import  bert_embeddings_from_list
vectorizer = CountVectorizer() #from sklearn
train_bow_embeddings = vectorizer.fit_transform(preprocessed_documents)
train_contextualized_embeddings = bert_embeddings_from_list(unpreprocessed_corpus, "bert-base-nli-mean-tokens")
vocab = vectorizer.get_feature_names_out()
id2token = {k: v for k, v in zip(range(0, len(vocab)), vocab)}

I am getting the following error:

TypeError Traceback (most recent call last)

in 2 vectorizer = CountVectorizer() #from sklearn 3 train_bow_embeddings = vectorizer.fit_transform(preprocessed_documents) ----> 4 train_contextualized_embeddings = bert_embeddings_from_list(unpreprocessed_corpus, "bert-base-nli-mean-tokens") 5 vocab = vectorizer.get_feature_names_out() 6 id2token = {k: v for k, v in zip(range(0, len(vocab)), vocab)} ~/.local/lib/python3.9/site-packages/contextualized_topic_models/utils/data_preparation.py in bert_embeddings_from_list(texts, sbert_model_to_load, batch_size, max_seq_length) 46 model.max_seq_length = max_seq_length 47 ---> 48 check_max_local_length(max_seq_length, texts) 49 50 return np.array(model.encode(texts, show_progress_bar=True, batch_size=batch_size)) ~/.local/lib/python3.9/site-packages/contextualized_topic_models/utils/data_preparation.py in check_max_local_length(max_seq_length, texts) 53 def check_max_local_length(max_seq_length, texts): 54 max_local_length = np.max([len(t.split()) for t in texts]) ---> 55 if max_local_length > max_seq_length: 56 warnings.simplefilter('always', DeprecationWarning) 57 warnings.warn(f"the longest document in your collection has {max_local_length} words, the model instead " TypeError: '>' not supported between instances of 'int' and 'NoneType'
josepius-clemson commented 1 year ago

Code:

text_for_bow = ["your_first document", "a second document in_the_collection"]
text_for_contextual = ["your first document", "a second document in the collection"]
vectorizer = CountVectorizer() #from sklearn
train_bow_embeddings = vectorizer.fit_transform(text_for_bow)
train_contextualized_embeddings = bert_embeddings_from_list(text_for_contextual, "bert-base-nli-mean-tokens")
vocab = vectorizer.get_feature_names_out()
id2token = {k: v for k, v in zip(range(0, len(vocab)), vocab)}

Error:

TypeError Traceback (most recent call last)

in 4 5 train_bow_embeddings = vectorizer.fit_transform(text_for_bow) ----> 6 train_contextualized_embeddings = bert_embeddings_from_list(text_for_contextual, "bert-base-nli-mean-tokens") 7 vocab = vectorizer.get_feature_names_out() 8 id2token = {k: v for k, v in zip(range(0, len(vocab)), vocab)} ~/.local/lib/python3.9/site-packages/contextualized_topic_models/utils/data_preparation.py in bert_embeddings_from_list(texts, sbert_model_to_load, batch_size, max_seq_length) 46 model.max_seq_length = max_seq_length 47 ---> 48 check_max_local_length(max_seq_length, texts) 49 50 return np.array(model.encode(texts, show_progress_bar=True, batch_size=batch_size)) ~/.local/lib/python3.9/site-packages/contextualized_topic_models/utils/data_preparation.py in check_max_local_length(max_seq_length, texts) 53 def check_max_local_length(max_seq_length, texts): 54 max_local_length = np.max([len(t.split()) for t in texts]) ---> 55 if max_local_length > max_seq_length: 56 warnings.simplefilter('always', DeprecationWarning) 57 warnings.warn(f"the longest document in your collection has {max_local_length} words, the model instead " TypeError: '>' not supported between instances of 'int' and 'NoneType' ​
vinid commented 1 year ago

yes, you need to choose an embedding model (chosen_contextualized_model is a placeholder)

also please consider using code formatting, otherwise, it's hard to read

josepius-clemson commented 1 year ago

Sorry, I missed to put the embedding model. I just updated the error I am getting in the previous to messages.

vinid commented 1 year ago

bert_embeddings_from_list(text_for_contextual, "bert-base-nli-mean-tokens", max_seq_length=200)

should fix the issue. Note that the max seq length depend on the model

josepius-clemson commented 1 year ago

Yeah. It worked thanks. I hope I should give bow_size=len(vocab) I ran the following code and it worked. Just to make sure this the correct code for training bi-grams:

from contextualized_topic_models.utils.data_preparation import TopicModelDataPreparation
from sklearn.feature_extraction.text import CountVectorizer
from contextualized_topic_models.utils.data_preparation import  bert_embeddings_from_list
sp = WhiteSpacePreprocessingStopwords(docs, stopwords_list=stopwords)
preprocessed_documents, unpreprocessed_corpus, vocab, retained_indices = sp.preprocess()
vectorizer = CountVectorizer(ngram_range=(2,2)) #from sklearn
train_bow_embeddings = vectorizer.fit_transform(preprocessed_documents)
train_contextualized_embeddings = bert_embeddings_from_list(unpreprocessed_corpus, "bert-base-nli-mean-tokens",max_seq_length=200)
vocab = vectorizer.get_feature_names_out()
id2token = {k: v for k, v in zip(range(0, len(vocab)), vocab)}

from contextualized_topic_models.datasets.dataset import  CTMDataset
import numpy as np 
np.random.seed(42)
training_dataset = CTMDataset(train_contextualized_embeddings, train_bow_embeddings, id2token, labels=None)
ctm = CombinedTM(bow_size=len(vocab), contextual_size=768, n_components=12,num_epochs=25) # 12 topics
ctm.fit(training_dataset) # run the model
ctm.get_topic_lists(15)
vinid commented 1 year ago

That should do the work I think!