ArikReuter / TopicGPT

TopicGPT allows to integrate the benefits of LLMs into Topic Modelling
https://lmu-seminar-llms.github.io/TopicGPT/
MIT License
66 stars 13 forks source link

indexEror #9

Open franck-nkolongo opened 2 months ago

franck-nkolongo commented 2 months ago

hello, I have a problem: reviews = list(review_data[2]) reviews = reviews[:5000] # only consider the first 5k reviews

IndexError: boolean index did not match indexed array along dimension 0; dimension is 5000 but corresponding boolean dimension is 1000.

this works with reviews = reviews[:1000]

deepbot86 commented 2 months ago

same here .. ` File ~/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/topicgpt/TopicRepresentation.py:310, in extract_topics_no_new_vocab_computation(corpus, vocab, document_embeddings, clusterer, vocab_embeddings, n_topwords, topword_extraction_methods, consider_outliers) 306 dim_red_centroids = umap_mapper.transform(np.array(list(centroid_dict.values()))) # map the centroids to low dimensional space 308 dim_red_centroid_dict = {label: centroid for label, centroid in zip(centroid_dict.keys(), dim_red_centroids)} --> 310 word_topic_mat = extractor.compute_word_topic_mat(corpus, vocab, labels, consider_outliers = consider_outliers) # compute the word-topic matrix of the corpus 311 if "tfidf" in topword_extraction_methods: 312 tfidf_topwords, tfidf_dict = extractor.extract_topwords_tfidf(word_topic_mat = word_topic_mat, vocab = vocab, labels = labels, top_n_words = n_topwords) # extract the top-words according to tfidf

File ~/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/topicgpt/ExtractTopWords.py:308, in ExtractTopWords.compute_word_topic_mat(self, corpus, vocab, labels, consider_outliers) 305 word_topic_mat = np.zeros((len(vocab), len((np.unique(labels))))) 307 for i, label in tqdm(enumerate(np.unique(labels)), desc="Computing word-topic matrix", total=len(np.unique(labels))): --> 308 topic_docs = corpus_arr[labels == label] 309 topic_doc_string = " ".join(topic_docs) 310 topic_doc_words = word_tokenize(topic_doc_string)

IndexError: boolean index did not match indexed array along dimension 0; dimension is 6969 but corresponding boolean dimension is 4999 `

franck-nkolongo commented 2 months ago

4999

I've found the solution, first you need to delete the directory (SaveEmeddings which includes the embeddings.pkl file). This file was initially made with 1000 data (in my case), in your case, you must have initially tried with a 4999 data set.