allenai / aspire

Repo for Aspire - A scientific document similarity model based on matching fine-grained aspects of scientific papers.
50 stars 5 forks source link

Tensor sizes not matching #3

Open ericchagnon15 opened 1 year ago

ericchagnon15 commented 1 year ago

I'm trying to use this model in Google Colab with BERTopic for topic modeling and am unable to run the model. I'm using a subset of the Arxiv dataset with concatenated title and abstract for the data.

from transformers import *
ASPIRE = pipeline("feature-extraction", model="allenai/aspire-sentence-embedder")

less_docs = arxiv_docs[:200]
topic_model = BERTopic(embedding_model=ASPIRE, language="english", nr_topics="auto", verbose=True )
topics, probs = topic_model.fit_transform(less_docs)

When the fit_transform() method is called the following error occurs: RuntimeError Traceback (most recent call last) in 5 6 topic_model = BERTopic(embedding_model=ASPIRE, language="english", nr_topics="auto", verbose=True ) ----> 7 topics, probs = topic_model.fit_transform(less_docs)

12 frames /usr/local/lib/python3.8/dist-packages/transformers/models/bert/modeling_bert.py in forward(self, input_ids, token_type_ids, position_ids, inputs_embeds, past_key_values_length) 235 if self.position_embedding_type == "absolute": 236 position_embeddings = self.position_embeddings(position_ids) --> 237 embeddings += position_embeddings 238 embeddings = self.LayerNorm(embeddings) 239 embeddings = self.dropout(embeddings)

RuntimeError: The size of tensor a (541) must match the size of tensor b (512) at non-singleton dimension 1

MSheshera commented 1 year ago

I have not used the model in this manner before so I'm not sure I could say definitively what was wrong. However from the looks of it, it may be a problem with the tokenizer not truncating the input documents to 512 tokens. If BERTopic has an option to truncate the input documents you can try doing that. Or else, you can manually truncate the individual documents of arxiv_docs to have ~450 (white space tokenized) tokens or so.