Closed flange-ipb closed 1 month ago
Sorry, I'm just a newbie here, is the error triggerable with one of the sample binaries or you are using the API?
@giallu I can replicate the finding with the 64 bit executable for Linux in an instance of Linux Debian 13/trixie. Could implicit hydrogen atoms be an issue here? This is because I observe
For easier replication, I attach the log file. The test did not use InChI 1.07.1 as packaged in Debian, but used the executable fetched from the project's page.
This applies to both the binaries and the API.
Binaries:
inchi-1 issue52_input.mol.txt
with the Molfile issue52_input.mol.txtAPI:
You can reproduce this in the InChI Web Demo by pasting Molfile in the Convert Molfile to InChI tab. This conversion calls MakeINCHIFromMolfileText
here.
@flange-ipb A bypass to the problem is to use the elder v2000 standard instead (.mol and .sdf only differ in that the later may contain multiple structure models separated by a line of $$$$
, while .mol can not). The syntax between v2000 and v3000 differ a little, because the v3000 dialect extended the support of stereochemistry (e.g., logical OR and logical AND to describe with one model of mandelic acid either both enantiomers, or a racemate), see here.
If your workflow can i) sustain to drop extended stereochemistry of .mol/.sdf (v3000) and ii) your mol files contain less than 999 atoms or bonds, you can convert the file into the elder v2000 syntax e.g., by openbabel
obabel input_v3000.mol -o output.mol
The reverse operation would require an explicit modifier to provide the v3000 dialect (documentation), i.e.
obabel input_v2000.mol -O output_v3000.sdf -x3
If you are not allowed to install additional software on the computer used, cheminfo.org has set up an online instance of openbabel, too.
Hi @nbehrnd, I think those argumentations go into the wrong directions.
The reason for this error is simply the existence of an empty bond block and InChI's v3000 Molfile parser complains about it. If the empty bond block
M V30 BEGIN BOND
M V30 END BOND
is removed, then the conversion works.
Implicit hydrogens: A counterexample would be to replace the carbon atom in my example Molfile with helium, which isn't supposed to receive implicit hydrogens (results in same error). If we have a Molfile of methane with explicit hydrogens, then the bond block is no more empty.
Parsing v3000 Molfiles: I think we all agree that InChI should support v3000 Molfiles and we have to ask ourselves how strict to interpret the specification you already mentioned. I have the impression most chemistry software on the market interpret it in a lax way, i.e. a slightly wrong input or output like my v3000 Molfile is still an acceptable deviation (Gerd will test it with BIOVIA's software suite ... excitement guaranteed :smile:). We can play the same game for any other Molfile reader and writer.
Why is my Molfile invalid according to the specification?
M V30 index type atom1 atom2 -
not being enclosed by square brackets), thus there cannot be a bond block at all.@flange-ipb Could writing the empty bond block a bug in the ketcher's processing? Because consultation of InChI's demo page to sketch an atom of helium processes this input to successfully yield InChI string, InChI key, etc.
Marvin, as an other implementation (demo page) can write an individual carbon as
Mrv2311 09192411202D
0 0 0 0 0 999 V3000
M V30 BEGIN CTAB
M V30 COUNTS 1 0 0 0 0
M V30 BEGIN ATOM
M V30 1 C -6.5 3.1667 0 0
M V30 END ATOM
M V30 END CTAB
M END
Or, to exclude implicit hydrogen atoms, the Helium atom:
Mrv2311 09192409232D
0 0 0 0 0 999 V3000
M V30 BEGIN CTAB
M V30 COUNTS 1 0 0 0 0
M V30 BEGIN ATOM
M V30 1 He -0.875 1.6667 0 0
M V30 END ATOM
M V30 END CTAB
M END
Both are processed smoothly by the Linux 64bit InChI executable. These observations support your argument sketchers around are not equally well prepared for single-atom structures expressed in mol/sdf (v3000 dialect). Or, that there is no consensus (yet).
A somewhat surprising additional find around "single atom of He" was openbabel (version 3.1.1 in Debian 13/trixie) which writes the empty bond block into the sdf, but nevertheless is able to assign an InChI (I don't know if InChI 1.03 or a more recent version is used):
$ obabel -:"[He]" -O obabel_Helium.sdf -x3
==============================
*** Open Babel Warning in WriteMolecule
No 2D or 3D coordinates exist. Stereochemical information will be stored using an Open Babel extension. To generate 2D or 3D coordinates instead use --gen2D or --gen3D.
1 molecule converted
$ cat obabel_Helium.sdf
OpenBabel09192411482D
0 0 0 0 0 999 V3000
M V30 BEGIN CTAB
M V30 COUNTS 1 0 0 0 0
M V30 BEGIN ATOM
M V30 1 He 0 0 0 0
M V30 END ATOM
M V30 BEGIN BOND
M V30 END BOND
M V30 END CTAB
M END
$$$$
$
$ ./inchi-1 ./obabel_Helium.sdf
InChI version 1, Software v. 1.07 (inchi-1 executable)
Linux 64-bit Build (gcc 11.4.0) of Aug 10 2024 18:58:34
Opened log file './obabel_Helium.sdf.log'
Opened input file './obabel_Helium.sdf'
Opened output file './obabel_Helium.sdf.txt'
Opened problem file './obabel_Helium.sdf.prb'
The command line used:
"./inchi-1 ./obabel_Helium.sdf"
Generating standard InChI
Input format: MOLfile
Output format: Plain text
Full Aux. info
Timeout per structure: 60000 msec
Up to 1024 atoms per structure
Error 71 (no InChI; Error: No V3000 CTAB end marker) inp structure #1.
End of file detected after structure #1.
Finished processing 1 structure: 1 error, processing time 0:00:00.00
$ cat obabel_Helium.sdf.txt
* Input_File: "./obabel_Helium.sdf"
$
$ obabel obabel_Helium.sdf -oinchi
InChI=1S/He
1 molecule converted
The Biovia suite you mention may be released more recently than Biovia Draw 2024 (version 24.1.0.1870) at my disposition; in the later, I did not identify yet an optional export .mol (v3000) in addition to its default export to .mol (v2000). ACD's ChemSketch defaults to the elder syntax, too.
Nice observation! So OpenBabel's Molfile writer is also "lax" on this.
I guess the reason why InChI conversion still works there is that OpenBabel doesn't go through InChI's Molfile parser at all. It constructs an inchi_Input
struct from its internal chemical representation, an OBMol
object. This can be seen in InChIFormat::WriteMolecule
. The reason for using an intermediate representation in OpenBabel is scalability - you don't want to write a converter between each of the >110 chemical file formats it supports.
You can observe the same strategy in other cheminformatics frameworks such as rdkit (ROMol
to inchi_Input
) or CDK (IAtomContainer
to InchiInput
, which ends up in a data structure from InChI's Extensible API (IXA)).
Very interesting points! From the implementation point of view, I just wanted to mention the best approach to maximize interoperability between tools is usually to be strict on output and tolerant on input
@giallu The sum formula is the first layer in the InChI string. With
[let's stride to implement the assignment of InChI to be] strict on output and tolerant on input
were the following interpretation
if the InChI algorithm identifies only one atom, a bond block (if present) will be skipped
for future reference implementations of InChI correct?
Yes, the issue was due to the empty bond block and the specific pointer was not getting updated. The issue has been removed from InChI now and you will be able to test it in the next release. I will close the issue for now.
@nnuk, could you link the closing commit (once it's pushed)?
Also, once the fix is pushed we can removed the "expected to fail" decorator on the test:
When there are no bonds in the structure some structure editors (Ketcher via the Indigo framework in the back) serialize V3000 Molfiles with an empty bond block:
This causes an error
Error 71 (no InChI; Error: No V3000 CTAB end marker) inp
in InChI. Other structure editors (Marvin JS) simply omit the bond block, but they happily accept this Molfile as input.