agemagician / ProtTrans

ProtTrans is providing state of the art pretrained language models for proteins. ProtTrans was trained on thousands of GPUs from Summit and hundreds of Google TPUs using Transformers Models.
Academic Free License v3.0
1.13k stars 153 forks source link

Dealing with PDB files #82

Closed sky-2002 closed 2 years ago

sky-2002 commented 2 years ago

Hello @agemagician and @mheinzinger ! I am doing some work which needs to deal with PDB files of proteins. I wanted to ask if there is any support available to map PDB files to embeddings which capture the structure of the protein ? Or if there is some other model that deals with PDB files, please let me know. 😇

mheinzinger commented 2 years ago

Hey :) So we currently only allowing amino acid sequences as input. You could use, for example, the PDB part of BioPython would allow you to easily extract fasta from PDB: https://biopython.org/docs/1.75/api/Bio.PDB.html From there on you could use our models. If you want to get embeddings for structures, I would probably use ESM-IF1: https://github.com/facebookresearch/esm Would be interesting to see whether concatenating structure- and sequence-embeddings improves over either of the embedding types. :)