ParmEd / ParmEd

Parameter/topology editor and molecular simulator
https://parmed.github.io/ParmEd
393 stars 148 forks source link

Mol2 parser error with multiple residues #1029

Open austinjpaul opened 5 years ago

austinjpaul commented 5 years ago

When trying to load a mol2 file as a structure, I'm getting an error from the head/tail residue bond logic:

[GCC 7.3.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import parmed
>>> print parmed.__version__
3.1.0
>>> mol2 = parmed.load_file('./glutathione.sdf', structure=True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ajp1/miniconda3/envs/abra/lib/python2.7/site-packages/parmed/formats/registry.py", line 197, in load_file
    return cls.parse(filename, *args, **kwargs)
  File "/home/ajp1/miniconda3/envs/abra/lib/python2.7/site-packages/parmed/formats/mol2.py", line 241, in parse
    res2.head = res2[idx2]
  File "/home/ajp1/miniconda3/envs/abra/lib/python2.7/site-packages/parmed/modeller/residue.py", line 364, in __getitem__
    return self.atoms[idx]
IndexError: list index out of range

Admittedly, there could be something funky with my mol2 file (pasted below), but would it be possible to bypass the ResidueTemplate building logic when using Structure?

cat glutathione.sdf
@<TRIPOS>MOLECULE
*****
 36 35 0 0 0
SMALL
GASTEIGER

@<TRIPOS>ATOM
      1 CB          1.4015   -0.7702    1.8372 C.3     1  CYS1        0.0040
      2 CG          0.5335    0.4165    1.3988 C.3     1  CYS1        0.0285
      3 CA         -4.1241   -0.3047   -1.5936 C.3     2  GLY2        0.0823
      4 CB         -1.3724   -1.9645   -1.5084 C.3     1  CYS1        0.0218
      5 CA          2.9011   -0.4666    1.9596 C.3     1  CYS1        0.0164
      6 CA         -1.1360   -0.5307   -1.0009 C.3     1  CYS1        0.1375
      7 CD         -0.9262    0.0388    1.3480 C.2     1  CYS1        0.1834
      8 C          -5.4772    0.3930   -1.6446 C.2     2  GLY2        0.0631
      9 C          -1.8715    0.4705   -1.9097 C.2     1  CYS1        0.2090
     10 C           3.4592    0.0068    0.6085 C.2     1  CYS1        0.0894
     11 N           3.1635    0.5309    3.0120 N.4     1  CYS1        0.2249
     12 N          -3.1196    0.5947   -2.1924 N.2     2  GLY2       -0.2470
     13 N          -1.6361   -0.3564    0.3609 N.2     1  CYS1       -0.2411
     14 OE1        -1.5422    0.1215    2.5420 O.2     1  CYS1       -0.4956
     15 O          -6.5177   -0.0748   -1.2026 O.co2   2  GLY2       -0.5479
     16 OXT        -5.4091    1.5933   -2.2730 O.co2   2  GLY2       -0.5479
     17 O          -1.0601    1.3686   -2.5138 O.2     1  CYS1       -0.4942
     18 O           3.5146    1.1826    0.2711 O.co2   1  CYS1       -0.5442
     19 OXT         3.8214   -0.9994   -0.2175 O.co2   1  CYS1       -0.5442
     20 SG         -0.6200   -3.2685   -0.4759 S.3     1  CYS1       -0.1757
     21 HB1         1.2791   -1.5542    1.1193 H       1  CYS1        0.0328
     22 HB2         1.0694   -1.0374    2.8186 H       1  CYS1        0.0328
     23 HG1         0.8450    0.7356    0.4262 H       1  CYS1        0.0361
     24 HG2         0.6538    1.2046    2.1124 H       1  CYS1        0.0361
     25 HA1        -4.1671   -1.2195   -2.1470 H       2  GLY2        0.0571
     26 HA2        -3.8621   -0.5299   -0.5810 H       2  GLY2        0.0571
     27 HB1        -0.9633   -2.0415   -2.4941 H       1  CYS1        0.0399
     28 HB2        -2.4296   -2.1264   -1.4794 H       1  CYS1        0.0399
     29 HA          3.4012   -1.3699    2.2406 H       1  CYS1        0.0915
     30 HA         -0.0805   -0.3553   -1.0138 H       1  CYS1        0.0651
     31 H1          2.6988    1.4071    2.7735 H       1  CYS1        0.2001
     32 H2          4.1684    0.6887    3.0880 H       1  CYS1        0.2001
     33 H3          2.8049    0.1875    3.9030 H       1  CYS1        0.2001
     34 HE1        -1.0651    0.4060    3.2966 H       1  CYS1        0.2934
     35 H          -0.1363    1.3439   -2.3588 H       1  CYS1        0.2935
     36 HG          0.7317   -3.1189   -0.4704 H       1  CYS1        0.1017
@<TRIPOS>BOND
     1     1     2    1
     2     1     5    1
     3     2     7    1
     4     3     8    1
     5     3    12    1
     6     4     6    1
     7     4    20    1
     8     5    10    1
     9     5    11    1
    10     6     9    1
    11     6    13    1
    12     7    13    2
    13     7    14    1
    14     8    15   ar
    15     8    16   ar
    16     9    12    2
    17     9    17    1
    18    10    18   ar
    19    10    19   ar
    20     1    21    1
    21     1    22    1
    22     2    23    1
    23     2    24    1
    24     3    25    1
    25     3    26    1
    26     4    27    1
    27     4    28    1
    28     5    29    1
    29     6    30    1
    30    11    31    1
    31    11    32    1
    32    11    33    1
    33    14    34    1
    34    17    35    1
    35    20    36    1
swails commented 5 years ago

Woah, this mol2 file is really messed up. The atoms from residue 1 (CYS) and residue 2 (GLY) are interleaved. ParmEd doesn't support this (and I'm almost certain the mol2 file doesn't permit this behavior in its specification).

Where did you find this file??

If you want a workaround, my suggestion is to change all of the residue names to XXXX or something. Otherwise your structure will have 11 residues regardless of what program reads it. That should also fix the error you're seeing.

That method could certainly use refactoring, but supporting parsing mol2 files as a Structure instead of residue templates was almost an afterthought -- they're mainly used to store topology and atom charge information for building residue template libraries.

austinjpaul commented 5 years ago

Hmmm... thanks for the help.

This mol2 file was generated from openbabel from an InChI string (InChI=1S/C10H17N3O6S/c11-5(10(18)19)1-2-7(14)13-6(4-20)9(17)12-3-8(15)16/h5-6,20H,1-4,11H2,(H,12,17)(H,13,14)(H,15,16)(H,18,19)/t5-,6-/m0/s1)

I was stuck using structure=True because of the repeated atom names. Telling openbabel not to output residue information not only changes all residue names to <1>, but gives the atoms distinct names.

Do you know where I might find the mol2 specification to see if this is indeed disallowed? I should perhaps go file an issue or two against obabel.

swails commented 5 years ago

You can find the mol2 file format specification from SYBYL here. It seems that this is in fact a legal mol2 file according to that spec (which doesn't really impose many constraints at all), but it imposes some pretty significant limitations on the format. For instance, every residue must have a different identifier (name), so you can't have two amino acids of the same type have the same name (e.g., 2 alanines) unless you're willing to have compliant parsers group them together into a single super-residue.

They really intended the mol2 file to be a small molecule format.

Can you convert that InChl string to another format ParmEd understands, like a PDB file, maybe?

The mol2 format is worse than the PDB format as far as compliance of the variations out there, and I place priority on supporting the files used by the MD engines supported by ParmEd.

Ruibin-Liu commented 1 year ago

I noted that the file extension is .sdf but the file content is of .mol2 format.

Ruibin-Liu commented 1 year ago

We can parse the file into a DataFrame, group the residues, and then write back to a mol2 file. For example, my python package pdbx2df can do that with the read_mol2 and write_mol2 functions.