IUPAC-InChI / InChI

Main InChI repository
https://iupac-inchi.github.io/InChI-Web-Demo/
MIT License
69 stars 9 forks source link

V3000 Molfile with empty bond block gives error #52

Closed flange-ipb closed 1 month ago

flange-ipb commented 2 months ago

When there are no bonds in the structure some structure editors (Ketcher via the Indigo framework in the back) serialize V3000 Molfiles with an empty bond block:


  -INDIGO-08292417452D

  0  0  0  0  0  0  0  0  0  0  0 V3000
M  V30 BEGIN CTAB
M  V30 COUNTS 1 0 0 0 0
M  V30 BEGIN ATOM
M  V30 1 C 9.35 -4.8 0.0 0
M  V30 END ATOM
M  V30 BEGIN BOND
M  V30 END BOND
M  V30 END CTAB
M  END

This causes an error Error 71 (no InChI; Error: No V3000 CTAB end marker) inp in InChI. Other structure editors (Marvin JS) simply omit the bond block, but they happily accept this Molfile as input.

giallu commented 2 months ago

Sorry, I'm just a newbie here, is the error triggerable with one of the sample binaries or you are using the API?

nbehrnd commented 2 months ago

@giallu I can replicate the finding with the 64 bit executable for Linux in an instance of Linux Debian 13/trixie. Could implicit hydrogen atoms be an issue here? This is because I observe

For easier replication, I attach the log file. The test did not use InChI 1.07.1 as packaged in Debian, but used the executable fetched from the project's page.

test.log

flange-ipb commented 2 months ago

This applies to both the binaries and the API.

Binaries:

API: You can reproduce this in the InChI Web Demo by pasting Molfile in the Convert Molfile to InChI tab. This conversion calls MakeINCHIFromMolfileText here.

nbehrnd commented 2 months ago

@flange-ipb A bypass to the problem is to use the elder v2000 standard instead (.mol and .sdf only differ in that the later may contain multiple structure models separated by a line of $$$$, while .mol can not). The syntax between v2000 and v3000 differ a little, because the v3000 dialect extended the support of stereochemistry (e.g., logical OR and logical AND to describe with one model of mandelic acid either both enantiomers, or a racemate), see here.

If your workflow can i) sustain to drop extended stereochemistry of .mol/.sdf (v3000) and ii) your mol files contain less than 999 atoms or bonds, you can convert the file into the elder v2000 syntax e.g., by openbabel

obabel input_v3000.mol -o output.mol

The reverse operation would require an explicit modifier to provide the v3000 dialect (documentation), i.e.

obabel input_v2000.mol -O output_v3000.sdf -x3

If you are not allowed to install additional software on the computer used, cheminfo.org has set up an online instance of openbabel, too.

flange-ipb commented 2 months ago

Hi @nbehrnd, I think those argumentations go into the wrong directions.

The reason for this error is simply the existence of an empty bond block and InChI's v3000 Molfile parser complains about it. If the empty bond block

M  V30 BEGIN BOND
M  V30 END BOND

is removed, then the conversion works.

Implicit hydrogens: A counterexample would be to replace the carbon atom in my example Molfile with helium, which isn't supposed to receive implicit hydrogens (results in same error). If we have a Molfile of methane with explicit hydrogens, then the bond block is no more empty.

Parsing v3000 Molfiles: I think we all agree that InChI should support v3000 Molfiles and we have to ask ourselves how strict to interpret the specification you already mentioned. I have the impression most chemistry software on the market interpret it in a lax way, i.e. a slightly wrong input or output like my v3000 Molfile is still an acceptable deviation (Gerd will test it with BIOVIA's software suite ... excitement guaranteed :smile:). We can play the same game for any other Molfile reader and writer.

Why is my Molfile invalid according to the specification?

nbehrnd commented 2 months ago

@flange-ipb Could writing the empty bond block a bug in the ketcher's processing? Because consultation of InChI's demo page to sketch an atom of helium processes this input to successfully yield InChI string, InChI key, etc.

2024-09-19_InChI_test_page

Marvin, as an other implementation (demo page) can write an individual carbon as


  Mrv2311 09192411202D          

  0  0  0     0  0            999 V3000
M  V30 BEGIN CTAB
M  V30 COUNTS 1 0 0 0 0
M  V30 BEGIN ATOM
M  V30 1 C -6.5 3.1667 0 0
M  V30 END ATOM
M  V30 END CTAB
M  END

Or, to exclude implicit hydrogen atoms, the Helium atom:


  Mrv2311 09192409232D          

  0  0  0     0  0            999 V3000
M  V30 BEGIN CTAB
M  V30 COUNTS 1 0 0 0 0
M  V30 BEGIN ATOM
M  V30 1 He -0.875 1.6667 0 0
M  V30 END ATOM
M  V30 END CTAB
M  END

Both are processed smoothly by the Linux 64bit InChI executable. These observations support your argument sketchers around are not equally well prepared for single-atom structures expressed in mol/sdf (v3000 dialect). Or, that there is no consensus (yet).

A somewhat surprising additional find around "single atom of He" was openbabel (version 3.1.1 in Debian 13/trixie) which writes the empty bond block into the sdf, but nevertheless is able to assign an InChI (I don't know if InChI 1.03 or a more recent version is used):

$ obabel -:"[He]" -O obabel_Helium.sdf -x3
==============================
*** Open Babel Warning  in WriteMolecule
  No 2D or 3D coordinates exist. Stereochemical information will be stored using an Open Babel extension. To generate 2D or 3D coordinates instead use --gen2D or --gen3D.
1 molecule converted
$ cat obabel_Helium.sdf

 OpenBabel09192411482D

  0  0  0     0  0            999 V3000
M  V30 BEGIN CTAB
M  V30 COUNTS 1 0 0 0 0
M  V30 BEGIN ATOM
M  V30 1 He 0 0 0 0
M  V30 END ATOM
M  V30 BEGIN BOND
M  V30 END BOND
M  V30 END CTAB
M  END
$$$$
$ 
$ ./inchi-1 ./obabel_Helium.sdf
InChI version 1, Software v. 1.07 (inchi-1 executable) 
Linux 64-bit Build (gcc 11.4.0) of Aug 10 2024 18:58:34

Opened log file './obabel_Helium.sdf.log'
Opened input file './obabel_Helium.sdf'
Opened output file './obabel_Helium.sdf.txt'
Opened problem file './obabel_Helium.sdf.prb'
The command line used:
"./inchi-1 ./obabel_Helium.sdf"
Generating standard InChI
Input format: MOLfile
Output format: Plain text
Full Aux. info
Timeout per structure: 60000 msec
Up to 1024 atoms per structure

Error 71 (no InChI; Error: No V3000 CTAB end marker) inp structure #1.
End of file detected after structure #1.   
Finished processing 1 structure: 1 error, processing time 0:00:00.00
$ cat obabel_Helium.sdf.txt 
* Input_File: "./obabel_Helium.sdf"
$ 
$ obabel obabel_Helium.sdf -oinchi
InChI=1S/He
1 molecule converted

The Biovia suite you mention may be released more recently than Biovia Draw 2024 (version 24.1.0.1870) at my disposition; in the later, I did not identify yet an optional export .mol (v3000) in addition to its default export to .mol (v2000). ACD's ChemSketch defaults to the elder syntax, too.

flange-ipb commented 2 months ago

Nice observation! So OpenBabel's Molfile writer is also "lax" on this.

I guess the reason why InChI conversion still works there is that OpenBabel doesn't go through InChI's Molfile parser at all. It constructs an inchi_Input struct from its internal chemical representation, an OBMol object. This can be seen in InChIFormat::WriteMolecule. The reason for using an intermediate representation in OpenBabel is scalability - you don't want to write a converter between each of the >110 chemical file formats it supports.

You can observe the same strategy in other cheminformatics frameworks such as rdkit (ROMol to inchi_Input) or CDK (IAtomContainer to InchiInput, which ends up in a data structure from InChI's Extensible API (IXA)).

giallu commented 2 months ago

Very interesting points! From the implementation point of view, I just wanted to mention the best approach to maximize interoperability between tools is usually to be strict on output and tolerant on input

nbehrnd commented 2 months ago

@giallu The sum formula is the first layer in the InChI string. With

[let's stride to implement the assignment of InChI to be] strict on output and tolerant on input

were the following interpretation

if the InChI algorithm identifies only one atom, a bond block (if present) will be skipped

for future reference implementations of InChI correct?

JanCBrammer commented 2 months ago

Replicated with https://github.com/IUPAC-InChI/InChI/blob/290f5478c0867403dd0d79402892773efee66ce6/INCHI-1-TEST/tests/test_executable/test_github_52.py.

nnuk commented 1 month ago

Yes, the issue was due to the empty bond block and the specific pointer was not getting updated. The issue has been removed from InChI now and you will be able to test it in the next release. I will close the issue for now.

JanCBrammer commented 1 month ago

@nnuk, could you link the closing commit (once it's pushed)?

JanCBrammer commented 1 month ago

Also, once the fix is pushed we can removed the "expected to fail" decorator on the test:

https://github.com/IUPAC-InChI/InChI/blob/290f5478c0867403dd0d79402892773efee66ce6/INCHI-1-TEST/tests/test_executable/test_github_52.py#L41