Open ericchagnon15 opened 1 year ago
I have not used the model in this manner before so I'm not sure I could say definitively what was wrong. However from the looks of it, it may be a problem with the tokenizer not truncating the input documents to 512 tokens. If BERTopic
has an option to truncate the input documents you can try doing that. Or else, you can manually truncate the individual documents of arxiv_docs to have ~450 (white space tokenized) tokens or so.
I'm trying to use this model in Google Colab with BERTopic for topic modeling and am unable to run the model. I'm using a subset of the Arxiv dataset with concatenated title and abstract for the data.
When the fit_transform() method is called the following error occurs: RuntimeError Traceback (most recent call last) in
5
6 topic_model = BERTopic(embedding_model=ASPIRE, language="english", nr_topics="auto", verbose=True )
----> 7 topics, probs = topic_model.fit_transform(less_docs)
12 frames /usr/local/lib/python3.8/dist-packages/transformers/models/bert/modeling_bert.py in forward(self, input_ids, token_type_ids, position_ids, inputs_embeds, past_key_values_length) 235 if self.position_embedding_type == "absolute": 236 position_embeddings = self.position_embeddings(position_ids) --> 237 embeddings += position_embeddings 238 embeddings = self.LayerNorm(embeddings) 239 embeddings = self.dropout(embeddings)
RuntimeError: The size of tensor a (541) must match the size of tensor b (512) at non-singleton dimension 1