ArikReuter / TopicGPT

TopicGPT allows to integrate the benefits of LLMs into Topic Modelling
https://lmu-seminar-llms.github.io/TopicGPT/
MIT License
37 stars 9 forks source link

indexEror #9

Open franck-nkolongo opened 2 days ago

franck-nkolongo commented 2 days ago

hello, I have a problem: reviews = list(review_data[2]) reviews = reviews[:5000] # only consider the first 5k reviews

IndexError: boolean index did not match indexed array along dimension 0; dimension is 5000 but corresponding boolean dimension is 1000.

this works with reviews = reviews[:1000]

deepbot86 commented 21 hours ago

same here .. ` File ~/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/topicgpt/TopicRepresentation.py:310, in extract_topics_no_new_vocab_computation(corpus, vocab, document_embeddings, clusterer, vocab_embeddings, n_topwords, topword_extraction_methods, consider_outliers) 306 dim_red_centroids = umap_mapper.transform(np.array(list(centroid_dict.values()))) # map the centroids to low dimensional space 308 dim_red_centroid_dict = {label: centroid for label, centroid in zip(centroid_dict.keys(), dim_red_centroids)} --> 310 word_topic_mat = extractor.compute_word_topic_mat(corpus, vocab, labels, consider_outliers = consider_outliers) # compute the word-topic matrix of the corpus 311 if "tfidf" in topword_extraction_methods: 312 tfidf_topwords, tfidf_dict = extractor.extract_topwords_tfidf(word_topic_mat = word_topic_mat, vocab = vocab, labels = labels, top_n_words = n_topwords) # extract the top-words according to tfidf

File ~/anaconda3/envs/pytorch_p310/lib/python3.10/site-packages/topicgpt/ExtractTopWords.py:308, in ExtractTopWords.compute_word_topic_mat(self, corpus, vocab, labels, consider_outliers) 305 word_topic_mat = np.zeros((len(vocab), len((np.unique(labels))))) 307 for i, label in tqdm(enumerate(np.unique(labels)), desc="Computing word-topic matrix", total=len(np.unique(labels))): --> 308 topic_docs = corpus_arr[labels == label] 309 topic_doc_string = " ".join(topic_docs) 310 topic_doc_words = word_tokenize(topic_doc_string)

IndexError: boolean index did not match indexed array along dimension 0; dimension is 6969 but corresponding boolean dimension is 4999 `