facebookresearch / esm

Evolutionary Scale Modeling (esm): Pretrained language models for proteins
MIT License
3.26k stars 643 forks source link

atom name all left justified #431

Open kad-ecoli opened 1 year ago

kad-ecoli commented 1 year ago

Bug description In all PDB files provided by ESM atlas, the atom name at column 13-16 of the ATOM record are always left justified, which is not the standard string formatting specified by the PDB format at https://www.wwpdb.org/documentation/file-format-content/format33/sect9.html#ATOM

Reproduction steps For example, the first residue of MGYP000911143359 is

ATOM      1 N    MET A   1     -26.091  68.903   7.841  1.00  0.90          N   
ATOM      2 CA   MET A   1     -26.275  67.677   7.069  1.00  0.91          C   
ATOM      3 C    MET A   1     -24.933  67.025   6.755  1.00  0.90          C   
ATOM      4 CB   MET A   1     -27.033  67.967   5.773  1.00  0.89          C   
ATOM      5 O    MET A   1     -24.314  67.331   5.734  1.00  0.90          O   
ATOM      6 CG   MET A   1     -28.544  67.973   5.934  1.00  0.86          C   
ATOM      7 SD   MET A   1     -29.390  68.904   4.598  1.00  0.86          S   
ATOM      8 CE   MET A   1     -29.202  67.734   3.224  1.00  0.83          C   

Expected behavior In standard PDB format, the above residue should have been

ATOM      1  N   MET A   1     -26.091  68.903   7.841  1.00  0.90          N   
ATOM      2  CA  MET A   1     -26.275  67.677   7.069  1.00  0.91          C   
ATOM      3  C   MET A   1     -24.933  67.025   6.755  1.00  0.90          C   
ATOM      4  CB  MET A   1     -27.033  67.967   5.773  1.00  0.89          C   
ATOM      5  O   MET A   1     -24.314  67.331   5.734  1.00  0.90          O   
ATOM      6  CG  MET A   1     -28.544  67.973   5.934  1.00  0.86          C   
ATOM      7  SD  MET A   1     -29.390  68.904   4.598  1.00  0.86          S   
ATOM      8  CE  MET A   1     -29.202  67.734   3.224  1.00  0.83          C   

Additional context Although the non-standard atom name justification does not affect the visualization of the PDB file, it does affect the some structure analysis tools such as US-align and REDUCE.

tomsercu commented 1 year ago

Thanks for flagging! @nikitos9000 Let's fix this in our internal and released esmfold infer_pdb functions simultaneously so new predictions are formatted correctly. Unfortunately we won't be able to fix the existing predictions in the Atlas.

kad-ecoli commented 1 year ago

I have prepared a C++ program at https://github.com/pylelab/USalign/blob/master/pdbAtomName.cpp which can very quickly fix atom name for old ESM atlas pdb files. With this program, it should be pretty easy to fix all existing predictions in a few days. The program can be compiled by

git clone https://github.com/pylelab/USalign.git
cd USalign
make pdbAtomName