facebookresearch / esm

Evolutionary Scale Modeling (esm): Pretrained language models for proteins
MIT License
3.16k stars 627 forks source link

Inverse folding model crashes with 3BTA #197

Closed konstin closed 2 years ago

konstin commented 2 years ago

Bug description Inverse folding model crashes when feeding the basic example the 3BTA pdf file.

Reproduction steps

Download 3BTA.pdb

Run

import esm

structure = esm.inverse_folding.util.load_structure("3bta.pdb", "A")
coords, seq = esm.inverse_folding.util.extract_coords_from_structure(structure)
model, alphabet = esm.pretrained.esm_if1_gvp4_t16_142M_UR50()
rep = esm.inverse_folding.util.get_encoder_output(model, alphabet, coords)

Expected behavior

It works even with a ZN in the structure so I get some embeddings in rep.

Logs

Found 1 chains: ['A']

Loaded chain A

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Input In [11], in <cell line: 4>()
      1 import esm
      3 structure = esm.inverse_folding.util.load_structure("3bta.pdb", "A")
----> 4 coords, seq = esm.inverse_folding.util.extract_coords_from_structure(structure)
      5 rep = esm.inverse_folding.util.get_encoder_output(model, alphabet, coords)

File /mnt/project/seqvec-search/pp1cb_ss22_structural_embeddings/.venv/lib/python3.8/site-packages/esm/inverse_folding/util.py:69, in extract_coords_from_structure(structure)
     67 coords = get_atom_coords_residuewise(["N", "CA", "C"], structure)
     68 residue_identities = get_residues(structure)[1]
---> 69 seq = ''.join([ProteinSequence.convert_letter_3to1(r) for r in residue_identities])
     70 return coords, seq

File /mnt/project/seqvec-search/pp1cb_ss22_structural_embeddings/.venv/lib/python3.8/site-packages/esm/inverse_folding/util.py:69, in <listcomp>(.0)
     67 coords = get_atom_coords_residuewise(["N", "CA", "C"], structure)
     68 residue_identities = get_residues(structure)[1]
---> 69 seq = ''.join([ProteinSequence.convert_letter_3to1(r) for r in residue_identities])
     70 return coords, seq

File /mnt/project/seqvec-search/pp1cb_ss22_structural_embeddings/.venv/lib/python3.8/site-packages/biotite/sequence/seqtypes.py:512, in ProteinSequence.convert_letter_3to1(symbol)
    497 @staticmethod
    498 def convert_letter_3to1(symbol):
    499     """
    500     Convert a 3-letter to a 1-letter amino acid representation.
    501     
   (...)
    510         1-letter amino acid representation.
    511     """
--> 512     return ProteinSequence._dict_3to1[symbol.upper()]

KeyError: 'ZN'

Additional context

I think the code is choking on the additional zinc ion that's in the structure.

biotite==0.32.0 (latest version) is installed

tomsercu commented 2 years ago

This should be resolved now with #205 - please reopen if you still see this issue!