facebookresearch / esm

Evolutionary Scale Modeling (esm): Pretrained language models for proteins
MIT License
3.16k stars 627 forks source link

Strict length requirement for mutant sequence scoring via score_log_likelihood #401

Open adalisan opened 1 year ago

adalisan commented 1 year ago

For a given pdb, the native sequence extracted from the file may be a different length than the variant sequences of interest . I want to score those variant sequences using ESM-IF model, but the score_log_likelihoods script can't evaluate those variants, because of tensor size mismatch. for example operands could not be broadcast together with shapes (443,) (358,)

I tried using - in the variants input file to pad the sequences to equal length, that won't work because "-" is not in the alphabet object (I think, this is a constraint due to biotite.Alphabet and ProteinSequence objects, not necessarily due to model vocabulary).

Is there a way to use padding or (insertion or deletions) in the input variant tokenization? Is it possible to implement this simply? I don't know enough about the internals of esm package to be able to pull this off with a simple work around.

Reproduction steps score_log_likelihoods.py ...

Expected behavior I want to be able to evaluate sequences that may insertions or deletions with respect to the "native" sequence from PDB.