parse_PDB returns shorter sequence than pdb file

dauparas / ProteinMPNN

Code for the ProteinMPNN paper

MIT License

934 stars 284 forks source link

parse_PDB returns shorter sequence than pdb file #28

Closed adrienchaton closed 1 year ago

adrienchaton commented 1 year ago

Hello,

I am just starting to use your model, it seems very good work ! When I follow your quickdemo.ipynb example but put my own pdb file it seems that the parsed sequences dont match.

The sequence of the pdb file has 297 residues (only made of the 20 natural AAs) and pdb_dict_list[0][f"seqchain{chain}"] is 296 residues long. The first residue has been cropped, it is an M.

Unfortunately I dont think I can share the pdb file but I wonder if there's anything I can check that could cause the first residue to be missing after parsing the pdb with your utils ...

Thanks for any hints !

adrienchaton commented 1 year ago

I tried with a pdb generated with colabfold given the full sequence of 297 residues. When parsing this pdb your codes return a sequence of the correct length, so I guess that pdb I had has some unusual formatting ...

dauparas commented 1 year ago

Hello!

Are any of the atoms (N, CA, C, or O) missing in the PDB file for the first residue? You could try running with --ca_only True and see if that returns the correct length of the sequence.

adrienchaton commented 1 year ago

Thanks ! That's indeed an issue with the pdb file I had, sorry. I opened both the experimental one and that of the colabfold prediction, some atoms are missing in the first residue. Using the colabfold pdb, your codes run without issue : )

dauparas commented 1 year ago

Sounds good! Thanks for reaching out!