max_position_embeddings and vocab_size

Bitbol-Lab / ProtMamba-ssm

ProtMamba: a homology-aware but alignment-free protein state space model

Apache License 2.0

44 stars 7 forks source link

Closed david-arredondo closed 1 month ago

david-arredondo commented 1 month ago

I am trying to evaluate the embeddings generated by ProtMamba, and I have a few questions:

max_position_embeddings = 2048 in the pre-trained model, which prevents me from passing any amino acid sequence greater than this length. Since I only want the embedding of a single sequence, which may be greater than 2048 but certainly within the context length of the model, is there any way to bypass this restriction?
What is the meaning of vocab_size = 50277?

Thanks!

damiano-sg commented 1 month ago

Hi, sorry for the delay:

For now we decided to fix the maximum positional embedding to 2048, therefore each sequence shouldn't be longer than that. This obviously doesn't limit the context length of the model, ideally you can feed hundreds of sequences as input (if each of them is shorter than 2048 residues). We plan to increase the maximum allowed length in the next versions of the model.
vocab_size = 50277 is the standard initialization used by Mamba, we overwrite that in our config to 38.

david-arredondo commented 1 month ago

Thank you for your reply. I will keep an eye out for the next version!