cheminfo / sdf-parser

Parse a SDF file and convert it to an array of objects
http://cheminfo.github.io/sdf-parser/
MIT License
11 stars 7 forks source link

This sdf cannot be parsed #1

Closed targos closed 7 years ago

targos commented 7 years ago

Example:

parse("warburganal Data in this file is licensed under the nmrshiftdb2 Data License, a copy of which can be found at http://nmrshiftdb.nmr.uni-koeln.de/nmrshiftdbhtml/nmrshiftdb2datalicense.txt\n  CDK\nnmrshiftdb2 234\n 18 19  0  0  0  0  0  0  0  0999 V2000\n   -6.4010    0.5490    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n   -6.4010   -0.9510    0.0000C   0  0  0  0  0  0  0  0  0  0  0  0\n   -5.1019   -1.7010    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n   -3.8029   -0.9510    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n   -3.8029    0.5490    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n   -5.1019    1.2990    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n   -2.5039   -1.7010    0.0000 C0  0  0  0  0  0  0  0  0  0  0  0\n   -1.2048   -0.9510    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n   -1.2048    0.5490    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n   -2.5039    1.2990    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n   -3.8029    2.0490    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n   -1.5397    2.4481    0.0000 C0  0  0  0  0  0  0  0  0  0  0  0\n   -6.0661   -2.8500    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n   -3.4680    2.4481    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0\n   -4.1378   -2.8500    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n    0.0942    1.2990    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n   -2.0527    3.8576    0.0000 O0  0  0  0  0  0  0  0  0  0  0  0\n    1.3932    0.5490    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0\n  1  2  1  0  0  0  0 \n  1  6  1  0  0  0  0 \n  2  3  1  0  0  0  0 \n  3  4  1  0  0  0  0 \n  3 13  1  1  0  0  0 \n  3 15  1  6  0  0  0 \n  4  5  1  0  0  0  0 \n  4  7  1  0  0  0  0 \n  5  6  1  0  0  0  0 \n  5 10  1  0  0  0  0 \n  5 111  1  0  0  0 \n  7  8  1  0  0  0  0 \n  8  9  2  0  0  0  0 \n  9 10  1  0  0  0  0 \n  9 16  1  0  0  0  0 \n 10 12  1  1  0  0  0 \n 10 14  1  6  0  0  0 \n 12 17  2  0  0  0  0 \n 16 18  2  0  0  0  0 \nM  END\n> <Temperature [K]>\r\n0:298 \r\n\r\n> <nmrshiftdb2 ID>\r\n234\r\n\r\n> <Field Strength [MHz]>\r\n0:50 \r\n\r\n> <Spectrum 13C 0>\r\n17.6;0.0Q;10|18.3;0.0T;0|22.6;0.0Q;12|26.5;0.0T;6|31.7;0.0T;5|33.5;0.0S;2|33.5;0.0S;14|41.8;0.0T;1|42.0;0.0S;4|42.2;0.0D;3|78.34;0.0S;9|140.99;0.0S;8|158.3;0.0D;7|193.4;0.0D;15|203.0;0.0D;11|\r\n\r\n> <Solvent>\r\n0:Chloroform-D1 (CDCl3) \r\n\r\n$$$$\r\nSubergorgiol\n  CDK\nnmrshiftdb2 2151\n 16 18  0  0  0  0  0  0  0  0999 V2000\n   -2.4064    1.8581    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n   -3.0738    1.3732    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n   -2.8189    0.5885    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n   -2.4064   -0.6810    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n   -1.7389   -0.1961    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n   -1.9938    0.5885    0.0000C   0  0  0  0  0  0  0  0  0  0  0  0\n   -0.9138    1.3732    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n   -0.6589    0.5885    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n    0.0421    0.1536    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n   -1.1389    0.2599    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n   -0.7301   -0.4568    0.0000C   0  0  0  0  0  0  0  0  0  0  0  0\n   -3.0738   -0.1961    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n   -3.8585   -0.4509    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n   -1.7389    1.3732    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n   -1.4840    2.1578    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n    0.0160   -0.6710    0.0000 O0  0  0  0  0  0  0  0  0  0  0  0\n  1  2  1  0  0  0  0 \n  2  3  1  0  0  0  0 \n  3  6  1  0  0  0  0 \n  6 14  1  0  0  0  0 \n  1 14  1  0  0  0  0 \n  6 10  1  0  0  0  0 \n 10  8  1  0  0  0  0 \n  8  7  2  0  0  0  0 \n 14  7  1  0  0  0  0 \n  3 12  1  1  0  0  0 \n 12  4  1  0  0  0  0 \n  4  5  1  0  0  0  0 \n  6  5  1  1  0  0  0 \n  89  1  0  0  0  0 \n  9 16  1  0  0  0  0 \n 10 11  1  1  0  0  0 \n 12 13  1  6  0  0  0 \n 14 15  1  1  0  0  0 \nM  END\n> <Temperature [K]>\r\n0:298 \r\n\r\n> <nmrshiftdb2 ID>\r\n2151\r\n\r\n> <Field Strength [MHz]>\r\n0:125 \r\n\r\n> <Spectrum 13C 0>\r\n17.7;0.0Q;10|20.0;0.0Q;12|22.9;0.0Q;14|28.9;0.0T;4|29.9;0.0T;1|35.8;0.0T;3|37.6;0.0T;0|39.7;0.0D;11|50.9;0.0D;9|57.3;0.0S;13|61.3;0.0T;8|64.1;0.0S;5|64.9;0.0D;2|134.0;0.0D;6|146.7;0.0S;7|\r\n\r\n> <Solvent>\r\n0:Chloroform-D1 (CDCl3) \r\n\r\n$$$$\r\n1,2,2,5,5-Pentamethyl-3-imidazoline 3-oxide\n  CDK\nnmrshiftdb2 2189\n 11 11  0  0  0  0  0  0  0  0999 V2000\n   -2.4127   -1.4040    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0\n   -3.2376   -1.39980.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n   -3.4965   -2.1831    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n   -2.8315   -2.6713    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0\n   -2.1618   -2.1898    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n   -1.3666   -1.9706    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n   -1.4499   -2.59980.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n   -1.8331   -0.8125    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0\n   -4.2164   -1.7665    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n   -4.0872   -2.7622    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n   -2.8372   -3.4955    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n  1  2  2  0  0  0  0 \n5  6  1  0  0  0  0 \n  5  7  1  0  0  0  0 \n  2  3  1  0  0  0  0 \n  1  8  1  0  0  0  0 \n  3  4  1  0  0  0  0 \n  3  9  1  0  0  0  0 \n  4  5  1  0  0  0  0 \n  3 10  1  0  0  0  0 \n  5  1  1  0  0  0  0 \n  4 11  1  0  0  0  0 \nM  CHG  1   1   1\nM  CHG  1   8  -1\nM  END\n> <Temperature [K]>\r\n0:298 \r\n\r\n> <nmrshiftdb2 ID>\r\n2189\r\n\r\n> <Field Strength [MHz]>\r\n0:50.328 \r\n\r\n> <Spectrum 13C 0>\r\n23.3;0.0Q;8|23.3;0.0Q;9|23.5;0.0Q;5|23.5;0.0Q;6|26.1;0.0Q;10|60.5;0.0S;2|90.0;0.0S;4|132.1;0.0D;1|\r\n\r\n> <Solvent>\r\n0:Acetone-D6 ((CD3)2CO) \r\n\r\n$$$$");

Returns:

{
  "time": 0,
  "molecules": [
    {
      "molfile": "warburganal Data in this file is licensed under the nmrshiftdb2 Data License, a copy of which can be found at http://nmrshiftdb.nmr.uni-koeln.de/nmrshiftdbhtml/nmrshiftdb2datalicense.txt\n  CDK\nnmrshiftdb2 234\n 18 19  0  0  0  0  0  0  0  0999 V2000\n   -6.4010    0.5490    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n   -6.4010   -
0.9510    0.0000C   0  0  0  0  0  0  0  0  0  0  0  0\n   -5.1019   -1.7010    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n   -3.8029   -0.9510    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n   -3.8029    0.5490    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n   -5.1019    1.2990    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n   -2.5039   -1.
7010    0.0000 C0  0  0  0  0  0  0  0  0  0  0  0\n   -1.2048   -0.9510    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n   -1.2048    0.5490    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n   -2.5039    1.2990    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n   -3.8029    2.0490    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n   -1.5397    2.4481
    0.0000 C0  0  0  0  0  0  0  0  0  0  0  0\n   -6.0661   -2.8500    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n   -3.4680    2.4481    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0\n   -4.1378   -2.8500    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n    0.0942    1.2990    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n   -2.0527    3.8576
0.0000 O0  0  0  0  0  0  0  0  0  0  0  0\n    1.3932    0.5490    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0\n  1  2  1  0  0  0  0 \n  1  6  1  0  0  0  0 \n  2  3  1  0  0  0  0 \n  3  4  1  0  0  0  0 \n  3 13  1  1  0  0  0 \n  3 15  1  6  0  0  0 \n  4  5  1  0  0  0  0 \n  4  7  1  0  0  0  0 \n  5  6  1  0  0  0  0 \n  5 10  1  0  0  0  0 \n
 5 111  1  0  0  0 \n  7  8  1  0  0  0  0 \n  8  9  2  0  0  0  0 \n  9 10  1  0  0  0  0 \n  9 16  1  0  0  0  0 \n 10 12  1  1  0  0  0 \n 10 14  1  6  0  0  0 \n 12 17  2  0  0  0  0 \n 16 18  2  0  0  0  0 \nM  END\n",
      "Temperature [K]": "0:298 \r\n0:298 \r\n0:298 \r",
      "nmrshiftdb2 ID": "234\r\n2151\r\n2189\r",
      "Field Strength [MHz]": "0:50 \r\n0:125 \r\n0:50.328 \r",
      "Spectrum 13C 0": "17.6;0.0Q;10|18.3;0.0T;0|22.6;0.0Q;12|26.5;0.0T;6|31.7;0.0T;5|33.5;0.0S;2|33.5;0.0S;14|41.8;0.0T;1|42.0;0.0S;4|42.2;0.0D;3|78.34;0.0S;9|140.99;0.0S;8|158.3;0.0D;7|193.4;0.0D;15|203.0;0.0D;11|\r\n17.7;0.0Q;10|20.0;0.0Q;12|22.9;0.0Q;14|28.9;0.0T;4|29.9;0.0T;1|35.8;0.0T;3|37.6;0.0T;0|39.7;0.0D;11|50.9;0.0D;9|57.3;0.0S;13|61.3;0.0T;8
|64.1;0.0S;5|64.9;0.0D;2|134.0;0.0D;6|146.7;0.0S;7|\r\n23.3;0.0Q;8|23.3;0.0Q;9|23.5;0.0Q;5|23.5;0.0Q;6|26.1;0.0Q;10|60.5;0.0S;2|90.0;0.0S;4|132.1;0.0D;1|\r",
      "Solvent": "0:Chloroform-D1 (CDCl3) \r\n\r\n$$$$\r\nSubergorgiol\n  CDK\nnmrshiftdb2 2151\n 16 18  0  0  0  0  0  0  0  0999 V2000\n   -2.4064    1.8581    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n   -3.0738    1.3732    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n   -2.8189    0.5885    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n   -2
.4064   -0.6810    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n   -1.7389   -0.1961    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n   -1.9938    0.5885    0.0000C   0  0  0  0  0  0  0  0  0  0  0  0\n   -0.9138    1.3732    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n   -0.6589    0.5885    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n    0.0
421    0.1536    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n   -1.1389    0.2599    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n   -0.7301   -0.4568    0.0000C   0  0  0  0  0  0  0  0  0  0  0  0\n   -3.0738   -0.1961    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n   -3.8585   -0.4509    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n   -1.738
9    1.3732    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n   -1.4840    2.1578    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n    0.0160   -0.6710    0.0000 O0  0  0  0  0  0  0  0  0  0  0  0\n  1  2  1  0  0  0  0 \n  2  3  1  0  0  0  0 \n  3  6  1  0  0  0  0 \n  6 14  1  0  0  0  0 \n  1 14  1  0  0  0  0 \n  6 10  1  0  0  0  0 \n 10  8  1
0  0  0  0 \n  8  7  2  0  0  0  0 \n 14  7  1  0  0  0  0 \n  3 12  1  1  0  0  0 \n 12  4  1  0  0  0  0 \n  4  5  1  0  0  0  0 \n  6  5  1  1  0  0  0 \n  89  1  0  0  0  0 \n  9 16  1  0  0  0  0 \n 10 11  1  1  0  0  0 \n 12 13  1  6  0  0  0 \n 14 15  1  1  0  0  0 \n0:Chloroform-D1 (CDCl3) \r\n\r\n$$$$\r\n1,2,2,5,5-Pentamethyl-3-imidazoline 3-oxi
de\n  CDK\nnmrshiftdb2 2189\n 11 11  0  0  0  0  0  0  0  0999 V2000\n   -2.4127   -1.4040    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0\n   -3.2376   -1.39980.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n   -3.4965   -2.1831    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n   -2.8315   -2.6713    0.0000 N   0  0  0  0  0  0  0  0  0  0  0  0\n   -2.
1618   -2.1898    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n   -1.3666   -1.9706    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n   -1.4499   -2.59980.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n   -1.8331   -0.8125    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0\n   -4.2164   -1.7665    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n   -4.0872
  -2.7622    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n   -2.8372   -3.4955    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0\n  1  2  2  0  0  0  0 \n5  6  1  0  0  0  0 \n  5  7  1  0  0  0  0 \n  2  3  1  0  0  0  0 \n  1  8  1  0  0  0  0 \n  3  4  1  0  0  0  0 \n  3  9  1  0  0  0  0 \n  4  5  1  0  0  0  0 \n  3 10  1  0  0  0  0 \n  5  1  1
0  0  0  0 \n  4 11  1  0  0  0  0 \nM  CHG  1   1   1\nM  CHG  1   8  -1\n0:Acetone-D6 ((CD3)2CO) \r\n\r"
    }
  ],
  "labels": [
    "Temperature [K]",
    "nmrshiftdb2 ID",
    "Field Strength [MHz]",
    "Spectrum 13C 0",
    "Solvent"
  ],
  "statistics": [
    //   ...
  ]
}

Only one molfile is detected. The rest is put in the solvent field. There seems to be an issue with line terminator handling.

@lpatiny

lpatiny commented 7 years ago

As far as I know this file is not formatted correctly because sometimes they use \n at the ned of the line and some time they use \r\n. The library expects that the EOL (end-of-line) delimiter is always the same. In this specific case I would replace the \r\n with \n before parsing. sdf = sdf.replace(/\r/g,'');

lpatiny commented 7 years ago

Added an option in the parser 'mixedEOL' to deal with this specific case: 01451f54ee2ffd5a6c7f744aefa182e4be7fc9ae