Goodman-lab / DP5

Python workflow for DP5 and DP4 analysis of organic molecules
Other
173 stars 99 forks source link

sdf2tinkerxyz on large molecules #44

Closed zjp closed 1 year ago

zjp commented 3 years ago

I've been encountering issues getting sdf2tinkerxyz (both the older Linux version and the newer Darwin version) to properly assign some hydrogens to their environment. I'm not sure what's causing it to be confused, but the result is that the entire pipeline stalls after Tinker crashes.

I've written some code to get around this by not adding those damaged files to the array of Tinker inputs, but I wonder whether you think this is a good solution. Are all inputs necessary?

Screen Shot 2020-07-24 at 11 23 08 AM

zjp commented 3 years ago
# tinker_will_fail is assigned to False at the top of the loop that generates the Tinker file list
outp = subprocess.check_output(convinp + inpf + '.sdf', stderr=subprocess.STDOUT, shell=True)
    if 'Warning' in outp.decode('utf-8'):
        tinker_will_fail = True
        print("Could not prepare valid Tinker input for " + inpf + " as an unknown atom type was found. The generated structure may be impossible or this may be a bug in one or more upstream projects.")
        continue

This is what I put in place

zjp commented 3 years ago

Now I'm confused why

for index, inchi in enumerate(my_inchis):
    test = Chem.MolFromInchi(inchi, sanitize=True, removeHs=False)
    test2 = Chem.AddHs(test, addCoords=True)
    test3 = AllChem.EmbedMolecule(test2)
    test4 = Chem.SDWriter('test' + str(index) + '.sdf')
    test4.write(test2)

produces different results over the array

my_inchis = [
    "InChI=1S/C46H43N3O10/c1-4-24-47-41(51)36-38-44(54)59-39(30-17-9-6-10-18-30)37(29-15-7-5-8-16-29)49(38)40(31-19-11-12-21-35(31)58-26-25-50)46(36)33-27-28(22-23-34(33)48-45(46)55)14-13-20-32(42(52)56-2)43(53)57-3/h4-12,15-19,21-23,27,32,36-40,50H,1,20,24-26H2,2-3H3,(H,47,51)(H,48,55)/t36-,37-,38-,39+,40+,46-/m0/s1/f/h47H,48H",
    "InChI=1S/C46H43N3O10/c1-4-24-47-41(51)36-38-44(54)59-39(30-17-9-6-10-18-30)37(29-15-7-5-8-16-29)49(38)40(31-19-11-12-21-35(31)58-26-25-50)46(36)33-27-28(22-23-34(33)48-45(46)55)14-13-20-32(42(52)56-2)43(53)57-3/h4-12,15-19,21-23,27,32,36-40,50H,1,20,24-26H2,2-3H3,(H,47,51)(H,48,55)/t36-,37-,38-,39+,40+,46+/m0/s1/f/h47H,48H",
    "InChI=1S/C46H43N3O10/c1-4-24-47-41(51)36-38-44(54)59-39(30-17-9-6-10-18-30)37(29-15-7-5-8-16-29)49(38)40(31-19-11-12-21-35(31)58-26-25-50)46(36)33-27-28(22-23-34(33)48-45(46)55)14-13-20-32(42(52)56-2)43(53)57-3/h4-12,15-19,21-23,27,32,36-40,50H,1,20,24-26H2,2-3H3,(H,47,51)(H,48,55)/t36-,37-,38-,39+,40-,46-/m0/s1/f/h47H,48H",
    "InChI=1S/C46H43N3O10/c1-4-24-47-41(51)36-38-44(54)59-39(30-17-9-6-10-18-30)37(29-15-7-5-8-16-29)49(38)40(31-19-11-12-21-35(31)58-26-25-50)46(36)33-27-28(22-23-34(33)48-45(46)55)14-13-20-32(42(52)56-2)43(53)57-3/h4-12,15-19,21-23,27,32,36-40,50H,1,20,24-26H2,2-3H3,(H,47,51)(H,48,55)/t36-,37-,38-,39+,40-,46+/m0/s1/f/h47H,48H",
    "InChI=1S/C46H43N3O10/c1-4-24-47-41(51)36-38-44(54)59-39(30-17-9-6-10-18-30)37(29-15-7-5-8-16-29)49(38)40(31-19-11-12-21-35(31)58-26-25-50)46(36)33-27-28(22-23-34(33)48-45(46)55)14-13-20-32(42(52)56-2)43(53)57-3/h4-12,15-19,21-23,27,32,36-40,50H,1,20,24-26H2,2-3H3,(H,47,51)(H,48,55)/t36-,37-,38-,39-,40+,46-/m0/s1/f/h47H,48H",
    "InChI=1S/C46H43N3O10/c1-4-24-47-41(51)36-38-44(54)59-39(30-17-9-6-10-18-30)37(29-15-7-5-8-16-29)49(38)40(31-19-11-12-21-35(31)58-26-25-50)46(36)33-27-28(22-23-34(33)48-45(46)55)14-13-20-32(42(52)56-2)43(53)57-3/h4-12,15-19,21-23,27,32,36-40,50H,1,20,24-26H2,2-3H3,(H,47,51)(H,48,55)/t36-,37-,38-,39-,40+,46+/m0/s1/f/h47H,48H",
    "InChI=1S/C46H43N3O10/c1-4-24-47-41(51)36-38-44(54)59-39(30-17-9-6-10-18-30)37(29-15-7-5-8-16-29)49(38)40(31-19-11-12-21-35(31)58-26-25-50)46(36)33-27-28(22-23-34(33)48-45(46)55)14-13-20-32(42(52)56-2)43(53)57-3/h4-12,15-19,21-23,27,32,36-40,50H,1,20,24-26H2,2-3H3,(H,47,51)(H,48,55)/t36-,37-,38-,39-,40-,46-/m0/s1/f/h47H,48H",
    "InChI=1S/C46H43N3O10/c1-4-24-47-41(51)36-38-44(54)59-39(30-17-9-6-10-18-30)37(29-15-7-5-8-16-29)49(38)40(31-19-11-12-21-35(31)58-26-25-50)46(36)33-27-28(22-23-34(33)48-45(46)55)14-13-20-32(42(52)56-2)43(53)57-3/h4-12,15-19,21-23,27,32,36-40,50H,1,20,24-26H2,2-3H3,(H,47,51)(H,48,55)/t36-,37-,38-,39-,40-,46+/m0/s1/f/h47H,48H",
    "InChI=1S/C46H43N3O10/c1-4-24-47-41(51)36-38-44(54)59-39(30-17-9-6-10-18-30)37(29-15-7-5-8-16-29)49(38)40(31-19-11-12-21-35(31)58-26-25-50)46(36)33-27-28(22-23-34(33)48-45(46)55)14-13-20-32(42(52)56-2)43(53)57-3/h4-12,15-19,21-23,27,32,36-40,50H,1,20,24-26H2,2-3H3,(H,47,51)(H,48,55)/t36-,37-,38+,39+,40+,46-/m0/s1/f/h47H,48H",
    "InChI=1S/C46H43N3O10/c1-4-24-47-41(51)36-38-44(54)59-39(30-17-9-6-10-18-30)37(29-15-7-5-8-16-29)49(38)40(31-19-11-12-21-35(31)58-26-25-50)46(36)33-27-28(22-23-34(33)48-45(46)55)14-13-20-32(42(52)56-2)43(53)57-3/h4-12,15-19,21-23,27,32,36-40,50H,1,20,24-26H2,2-3H3,(H,47,51)(H,48,55)/t36-,37-,38+,39+,40+,46+/m0/s1/f/h47H,48H",
    "InChI=1S/C46H43N3O10/c1-4-24-47-41(51)36-38-44(54)59-39(30-17-9-6-10-18-30)37(29-15-7-5-8-16-29)49(38)40(31-19-11-12-21-35(31)58-26-25-50)46(36)33-27-28(22-23-34(33)48-45(46)55)14-13-20-32(42(52)56-2)43(53)57-3/h4-12,15-19,21-23,27,32,36-40,50H,1,20,24-26H2,2-3H3,(H,47,51)(H,48,55)/t36-,37-,38+,39+,40-,46-/m0/s1/f/h47H,48H",
    "InChI=1S/C46H43N3O10/c1-4-24-47-41(51)36-38-44(54)59-39(30-17-9-6-10-18-30)37(29-15-7-5-8-16-29)49(38)40(31-19-11-12-21-35(31)58-26-25-50)46(36)33-27-28(22-23-34(33)48-45(46)55)14-13-20-32(42(52)56-2)43(53)57-3/h4-12,15-19,21-23,27,32,36-40,50H,1,20,24-26H2,2-3H3,(H,47,51)(H,48,55)/t36-,37-,38+,39+,40-,46+/m0/s1/f/h47H,48H",
    "InChI=1S/C46H43N3O10/c1-4-24-47-41(51)36-38-44(54)59-39(30-17-9-6-10-18-30)37(29-15-7-5-8-16-29)49(38)40(31-19-11-12-21-35(31)58-26-25-50)46(36)33-27-28(22-23-34(33)48-45(46)55)14-13-20-32(42(52)56-2)43(53)57-3/h4-12,15-19,21-23,27,32,36-40,50H,1,20,24-26H2,2-3H3,(H,47,51)(H,48,55)/t36-,37-,38+,39-,40+,46-/m0/s1/f/h47H,48H",
    "InChI=1S/C46H43N3O10/c1-4-24-47-41(51)36-38-44(54)59-39(30-17-9-6-10-18-30)37(29-15-7-5-8-16-29)49(38)40(31-19-11-12-21-35(31)58-26-25-50)46(36)33-27-28(22-23-34(33)48-45(46)55)14-13-20-32(42(52)56-2)43(53)57-3/h4-12,15-19,21-23,27,32,36-40,50H,1,20,24-26H2,2-3H3,(H,47,51)(H,48,55)/t36-,37-,38+,39-,40+,46+/m0/s1/f/h47H,48H",
    "InChI=1S/C46H43N3O10/c1-4-24-47-41(51)36-38-44(54)59-39(30-17-9-6-10-18-30)37(29-15-7-5-8-16-29)49(38)40(31-19-11-12-21-35(31)58-26-25-50)46(36)33-27-28(22-23-34(33)48-45(46)55)14-13-20-32(42(52)56-2)43(53)57-3/h4-12,15-19,21-23,27,32,36-40,50H,1,20,24-26H2,2-3H3,(H,47,51)(H,48,55)/t36-,37-,38+,39-,40-,46-/m0/s1/f/h47H,48H",
    "InChI=1S/C46H43N3O10/c1-4-24-47-41(51)36-38-44(54)59-39(30-17-9-6-10-18-30)37(29-15-7-5-8-16-29)49(38)40(31-19-11-12-21-35(31)58-26-25-50)46(36)33-27-28(22-23-34(33)48-45(46)55)14-13-20-32(42(52)56-2)43(53)57-3/h4-12,15-19,21-23,27,32,36-40,50H,1,20,24-26H2,2-3H3,(H,47,51)(H,48,55)/t36-,37-,38+,39-,40-,46-/m0/s1/f/h47H,48H",
    "InChI=1S/C46H43N3O10/c1-4-24-47-41(51)36-38-44(54)59-39(30-17-9-6-10-18-30)37(29-15-7-5-8-16-29)49(38)40(31-19-11-12-21-35(31)58-26-25-50)46(36)33-27-28(22-23-34(33)48-45(46)55)14-13-20-32(42(52)56-2)43(53)57-3/h4-12,15-19,21-23,27,32,36-40,50H,1,20,24-26H2,2-3H3,(H,47,51)(H,48,55)/t36-,37-,38+,39-,40-,46+/m0/s1/f/h47H,48H",
    "InChI=1S/C46H43N3O10/c1-4-24-47-41(51)36-38-44(54)59-39(30-17-9-6-10-18-30)37(29-15-7-5-8-16-29)49(38)40(31-19-11-12-21-35(31)58-26-25-50)46(36)33-27-28(22-23-34(33)48-45(46)55)14-13-20-32(42(52)56-2)43(53)57-3/h4-12,15-19,21-23,27,32,36-40,50H,1,20,24-26H2,2-3H3,(H,47,51)(H,48,55)/t36-,37+,38-,39+,40+,46-/m0/s1/f/h47H,48H",
    "InChI=1S/C46H43N3O10/c1-4-24-47-41(51)36-38-44(54)59-39(30-17-9-6-10-18-30)37(29-15-7-5-8-16-29)49(38)40(31-19-11-12-21-35(31)58-26-25-50)46(36)33-27-28(22-23-34(33)48-45(46)55)14-13-20-32(42(52)56-2)43(53)57-3/h4-12,15-19,21-23,27,32,36-40,50H,1,20,24-26H2,2-3H3,(H,47,51)(H,48,55)/t36-,37+,38-,39+,40+,46+/m0/s1/f/h47H,48H",
    "InChI=1S/C46H43N3O10/c1-4-24-47-41(51)36-38-44(54)59-39(30-17-9-6-10-18-30)37(29-15-7-5-8-16-29)49(38)40(31-19-11-12-21-35(31)58-26-25-50)46(36)33-27-28(22-23-34(33)48-45(46)55)14-13-20-32(42(52)56-2)43(53)57-3/h4-12,15-19,21-23,27,32,36-40,50H,1,20,24-26H2,2-3H3,(H,47,51)(H,48,55)/t36-,37+,38-,39+,40-,46-/m0/s1/f/h47H,48H",
    "InChI=1S/C46H43N3O10/c1-4-24-47-41(51)36-38-44(54)59-39(30-17-9-6-10-18-30)37(29-15-7-5-8-16-29)49(38)40(31-19-11-12-21-35(31)58-26-25-50)46(36)33-27-28(22-23-34(33)48-45(46)55)14-13-20-32(42(52)56-2)43(53)57-3/h4-12,15-19,21-23,27,32,36-40,50H,1,20,24-26H2,2-3H3,(H,47,51)(H,48,55)/t36-,37+,38-,39+,40-,46+/m0/s1/f/h47H,48H",
    "InChI=1S/C46H43N3O10/c1-4-24-47-41(51)36-38-44(54)59-39(30-17-9-6-10-18-30)37(29-15-7-5-8-16-29)49(38)40(31-19-11-12-21-35(31)58-26-25-50)46(36)33-27-28(22-23-34(33)48-45(46)55)14-13-20-32(42(52)56-2)43(53)57-3/h4-12,15-19,21-23,27,32,36-40,50H,1,20,24-26H2,2-3H3,(H,47,51)(H,48,55)/t36-,37+,38-,39-,40+,46-/m0/s1/f/h47H,48H",
    "InChI=1S/C46H43N3O10/c1-4-24-47-41(51)36-38-44(54)59-39(30-17-9-6-10-18-30)37(29-15-7-5-8-16-29)49(38)40(31-19-11-12-21-35(31)58-26-25-50)46(36)33-27-28(22-23-34(33)48-45(46)55)14-13-20-32(42(52)56-2)43(53)57-3/h4-12,15-19,21-23,27,32,36-40,50H,1,20,24-26H2,2-3H3,(H,47,51)(H,48,55)/t36-,37+,38-,39-,40+,46+/m0/s1/f/h47H,48H",
    "InChI=1S/C46H43N3O10/c1-4-24-47-41(51)36-38-44(54)59-39(30-17-9-6-10-18-30)37(29-15-7-5-8-16-29)49(38)40(31-19-11-12-21-35(31)58-26-25-50)46(36)33-27-28(22-23-34(33)48-45(46)55)14-13-20-32(42(52)56-2)43(53)57-3/h4-12,15-19,21-23,27,32,36-40,50H,1,20,24-26H2,2-3H3,(H,47,51)(H,48,55)/t36-,37+,38-,39-,40-,46-/m0/s1/f/h47H,48H",
    "InChI=1S/C46H43N3O10/c1-4-24-47-41(51)36-38-44(54)59-39(30-17-9-6-10-18-30)37(29-15-7-5-8-16-29)49(38)40(31-19-11-12-21-35(31)58-26-25-50)46(36)33-27-28(22-23-34(33)48-45(46)55)14-13-20-32(42(52)56-2)43(53)57-3/h4-12,15-19,21-23,27,32,36-40,50H,1,20,24-26H2,2-3H3,(H,47,51)(H,48,55)/t36-,37+,38-,39-,40-,46+/m0/s1/f/h47H,48H",
    "InChI=1S/C46H43N3O10/c1-4-24-47-41(51)36-38-44(54)59-39(30-17-9-6-10-18-30)37(29-15-7-5-8-16-29)49(38)40(31-19-11-12-21-35(31)58-26-25-50)46(36)33-27-28(22-23-34(33)48-45(46)55)14-13-20-32(42(52)56-2)43(53)57-3/h4-12,15-19,21-23,27,32,36-40,50H,1,20,24-26H2,2-3H3,(H,47,51)(H,48,55)/t36-,37+,38+,39+,40+,46-/m0/s1/f/h47H,48H",
    "InChI=1S/C46H43N3O10/c1-4-24-47-41(51)36-38-44(54)59-39(30-17-9-6-10-18-30)37(29-15-7-5-8-16-29)49(38)40(31-19-11-12-21-35(31)58-26-25-50)46(36)33-27-28(22-23-34(33)48-45(46)55)14-13-20-32(42(52)56-2)43(53)57-3/h4-12,15-19,21-23,27,32,36-40,50H,1,20,24-26H2,2-3H3,(H,47,51)(H,48,55)/t36-,37+,38+,39+,40+,46+/m0/s1/f/h47H,48H",
    "InChI=1S/C46H43N3O10/c1-4-24-47-41(51)36-38-44(54)59-39(30-17-9-6-10-18-30)37(29-15-7-5-8-16-29)49(38)40(31-19-11-12-21-35(31)58-26-25-50)46(36)33-27-28(22-23-34(33)48-45(46)55)14-13-20-32(42(52)56-2)43(53)57-3/h4-12,15-19,21-23,27,32,36-40,50H,1,20,24-26H2,2-3H3,(H,47,51)(H,48,55)/t36-,37+,38+,39+,40-,46-/m0/s1/f/h47H,48H",
    "InChI=1S/C46H43N3O10/c1-4-24-47-41(51)36-38-44(54)59-39(30-17-9-6-10-18-30)37(29-15-7-5-8-16-29)49(38)40(31-19-11-12-21-35(31)58-26-25-50)46(36)33-27-28(22-23-34(33)48-45(46)55)14-13-20-32(42(52)56-2)43(53)57-3/h4-12,15-19,21-23,27,32,36-40,50H,1,20,24-26H2,2-3H3,(H,47,51)(H,48,55)/t36-,37+,38+,39+,40-,46+/m0/s1/f/h47H,48H",
    "InChI=1S/C46H43N3O10/c1-4-24-47-41(51)36-38-44(54)59-39(30-17-9-6-10-18-30)37(29-15-7-5-8-16-29)49(38)40(31-19-11-12-21-35(31)58-26-25-50)46(36)33-27-28(22-23-34(33)48-45(46)55)14-13-20-32(42(52)56-2)43(53)57-3/h4-12,15-19,21-23,27,32,36-40,50H,1,20,24-26H2,2-3H3,(H,47,51)(H,48,55)/t36-,37+,38+,39-,40+,46-/m0/s1/f/h47H,48H",
    "InChI=1S/C46H43N3O10/c1-4-24-47-41(51)36-38-44(54)59-39(30-17-9-6-10-18-30)37(29-15-7-5-8-16-29)49(38)40(31-19-11-12-21-35(31)58-26-25-50)46(36)33-27-28(22-23-34(33)48-45(46)55)14-13-20-32(42(52)56-2)43(53)57-3/h4-12,15-19,21-23,27,32,36-40,50H,1,20,24-26H2,2-3H3,(H,47,51)(H,48,55)/t36-,37+,38+,39-,40+,46+/m0/s1/f/h47H,48H",
    "InChI=1S/C46H43N3O10/c1-4-24-47-41(51)36-38-44(54)59-39(30-17-9-6-10-18-30)37(29-15-7-5-8-16-29)49(38)40(31-19-11-12-21-35(31)58-26-25-50)46(36)33-27-28(22-23-34(33)48-45(46)55)14-13-20-32(42(52)56-2)43(53)57-3/h4-12,15-19,21-23,27,32,36-40,50H,1,20,24-26H2,2-3H3,(H,47,51)(H,48,55)/t36-,37+,38+,39-,40-,46-/m0/s1/f/h47H,48H",
    "InChI=1S/C46H43N3O10/c1-4-24-47-41(51)36-38-44(54)59-39(30-17-9-6-10-18-30)37(29-15-7-5-8-16-29)49(38)40(31-19-11-12-21-35(31)58-26-25-50)46(36)33-27-28(22-23-34(33)48-45(46)55)14-13-20-32(42(52)56-2)43(53)57-3/h4-12,15-19,21-23,27,32,36-40,50H,1,20,24-26H2,2-3H3,(H,47,51)(H,48,55)/t36-,37+,38+,39-,40-,46+/m0/s1/f/h47H,48H"
]

than InchiGen.py

KristapsE commented 3 years ago

I've been encountering issues getting sdf2tinkerxyz (both the older Linux version and the newer Darwin version) to properly assign some hydrogens to their environment. I'm not sure what's causing it to be confused, but the result is that the entire pipeline stalls after Tinker crashes.

I've written some code to get around this by not adding those damaged files to the array of Tinker inputs, but I wonder whether you think this is a good solution. Are all inputs necessary?

What exactly do you mean that they are improperly assigned to their environment? That the hydrogen connectivity gets scrambled or that the numbering is different? In principle this shouldn't be a problem, as everything should work fine with either unassigned text NMR data, or just raw NMR data. However some years ago I stopped using TINKER in everyday workflows because of it's erratic behaviour with regards to the input coordinates.

Shouldn't be too difficult to see if TINKER still is picky about the numbering order - using either openbabel or rdkit, permute the atom order and see if TINKER fails on some of them. If it is still the issue, then the TreeRenum.py is quite old code for renumbering molecules in a way that maximizes connectivity between consecutive atoms and minimizes "jumps", where consecutive atoms are not connected.

Are all inputs necessary?

What do you mean by this? Whether some files could be ommitted probably depends on the structural question that you're trying to answer... If you are going for a large scale workflow, then I'd say that having to randomly cherry pick your inputs to appease a possibly buggy MM software does not seem the way to go.

KristapsE commented 3 years ago

To be honest, the whole thing about having a binary utility in otherwise pure Python workflow has never seemed ideal, but with the minimal amount of TINKER work we do it has never been high on priority list to fix.

Ideally, the utility could be ported to python and then both the portability and maintainability would be much improved.

The structures that you are dealing with are quite large and complex, I wonder if you have encountered the issue with something a bit more minimal?

KristapsE commented 3 years ago

produces different results over the array

Could you perhaps share a more minimal example of what the output is from RDKit and what it is from the InchiGen? Admittedly, InchiGen was written a long time ago and may be a bit hacky, but on the other hand some work in our group has also highlighted some deficiencies in RDKit Inchi code.

zjp commented 3 years ago

What exactly do you mean that they are improperly assigned to their environment? That the hydrogen connectivity gets scrambled or that the numbering is different? In principle this shouldn't be a problem, as everything should work fine with either unassigned text NMR data, or just raw NMR data. However some years ago I stopped using TINKER in everyday workflows because of it's erratic behaviour with regards to the input coordinates.

At the time I meant that sdf2tinkerxyz sees a hydrogen it can't place, gives it an environment of 0 or unknown and no bonds, and then every hydrogen that comes after that is shifted back one place, but I spent all day in a debugger looking at sdf2tinkerxyz's source code yesterday and it looks like it does exactly what it says it will do for the inputs it gets.

I then noticed that some of the generated SDF files were damaged in that they didn't have the same number of bonds as the other, good generated files. That put me back to square one. I think the problem is either in InchiGen, RDKit, or OpenBabel depending on how early errors start.

If I can make the output from a little test script and the output from InchiGen consistent with one another then it's a strong indicator that the issue doesn't exist in this project but is actually in RDKit or OpenBabel.

Are all inputs necessary?

What do you mean by this?

Whether some files could be ommitted probably depends on the structural question that you're trying to answer... If you are going for a large scale workflow, then I'd say that having to randomly cherry pick your inputs to appease a possibly buggy MM software does not seem the way to go.

This is a sufficient answer, thank you. I agree.

The structures that you are dealing with are quite large and complex, I wonder if you have encountered the issue with something a bit more minimal?

I haven't seen something like this with smaller molecules yet but I'd need to test more of them.

Could you perhaps share a more minimal example of what the output is from RDKit and what it is from the InchiGen?

I'll come up with something.

[O]on the other hand some work in our group has also highlighted some deficiencies in RDKit Inchi code.

I guess in some sense if you're not breaking their code then you're not pushing it hard enough. :p

Jonathan-Goodman commented 1 year ago

I think this issue has been concluded. Please open a new issue if there is more information