For a given pdb, the native sequence extracted from the file may be a different length than the variant sequences of interest . I want to score those variant sequences using ESM-IF model, but the score_log_likelihoods script can't evaluate those variants, because of tensor size mismatch.
for example
operands could not be broadcast together with shapes (443,) (358,)
I tried using - in the variants input file to pad the sequences to equal length, that won't work because "-" is not in the alphabet object (I think, this is a constraint due to biotite.Alphabet and ProteinSequence objects, not necessarily due to model vocabulary).
Is there a way to use padding or (insertion or deletions) in the input variant tokenization? Is it possible to implement this simply? I don't know enough about the internals of esm package to be able to pull this off with a simple work around.
Reproduction steps
score_log_likelihoods.py ...
Expected behavior
I want to be able to evaluate sequences that may insertions or deletions with respect to the "native" sequence from PDB.
For a given pdb, the native sequence extracted from the file may be a different length than the variant sequences of interest . I want to score those variant sequences using ESM-IF model, but the score_log_likelihoods script can't evaluate those variants, because of tensor size mismatch. for example
operands could not be broadcast together with shapes (443,) (358,)
I tried using - in the variants input file to pad the sequences to equal length, that won't work because "-" is not in the alphabet object (I think, this is a constraint due to biotite.Alphabet and ProteinSequence objects, not necessarily due to model vocabulary).
Is there a way to use padding or (insertion or deletions) in the input variant tokenization? Is it possible to implement this simply? I don't know enough about the internals of esm package to be able to pull this off with a simple work around.
Reproduction steps score_log_likelihoods.py ...
Expected behavior I want to be able to evaluate sequences that may insertions or deletions with respect to the "native" sequence from PDB.