epam / Indigo

Universal cheminformatics toolkit, utilities and database search tools
http://lifescience.opensource.epam.com
Apache License 2.0
314 stars 102 forks source link

[CDX] UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa2 in position 29: invalid start byte when converting parsed molecules to SMILES or molfile #1160

Closed eloyfelix closed 1 year ago

eloyfelix commented 1 year ago

Steps to Reproduce

  1. Use Indigo library (Bingo cartridge). Describe environment Indigo built from master branch, commit: https://github.com/epam/Indigo/commit/1fd76de5be68b180e97f9a313771294e6c8d1498 Python 3.11.3 in Ubuntu Linux

  2. Add script or SQL to reproduce the issue

from indigo import Indigo

indigo = Indigo()

for mol in indigo.iterateCDXFile("US06168253-20010102-C00003.CDX"):
    print("heavy atoms:", mol.countHeavyAtoms())
    print("smiles:", mol.smiles())
heavy atoms: 13
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
Cell In[17], line 3
      1 for mol in indigo.iterateCDXFile("US06168253-20010102-C00003.CDX"):
      2     print("heavy atoms:", mol.countHeavyAtoms())
----> 3     print("smiles:", mol.smiles())

File [~/.pyenv/versions/3.11.3/envs/uspto-patents/lib/python3.11/site-packages/indigo/indigo/indigo_object.py:3375](https://file+.vscode-resource.vscode-cdn.net/home/eloy/patent_chem/example_files/~/.pyenv/versions/3.11.3/envs/uspto-patents/lib/python3.11/site-packages/indigo/indigo/indigo_object.py:3375), in IndigoObject.smiles(self)
   3368 def smiles(self):
   3369     """Molecule or reaction method calculates SMILES for the structure
   3370 
   3371     Returns:
   3372         str: smiles string
   3373     """
-> 3375     return IndigoLib.checkResultString(self._lib().indigoSmiles(self.id))

File [~/.pyenv/versions/3.11.3/envs/uspto-patents/lib/python3.11/site-packages/indigo/indigo/indigo_lib.py:1026](https://file+.vscode-resource.vscode-cdn.net/home/eloy/patent_chem/example_files/~/.pyenv/versions/3.11.3/envs/uspto-patents/lib/python3.11/site-packages/indigo/indigo/indigo_lib.py:1026), in IndigoLib.checkResultString(result, exception_class)
   1022 @staticmethod
   1023 def checkResultString(
   1024     result: bytes, exception_class: Type[Exception] = IndigoException
   1025 ):
-> 1026     return IndigoLib.checkResultPtr(result, exception_class).decode()

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa2 in position 29: invalid start byte

Expected behavior molecules to be converted to SMILES or Molfiles

Actual behavior The parser seems to be able to read the molecule since it is possible to get some information from it but it fails to write SMILES, Molfile or other text representations of it.

Attachments US06168253-20010102-C00003.zip

Indigo/Bingo version
Indigo built from master branch, commit: https://github.com/epam/Indigo/commit/1fd76de5be68b180e97f9a313771294e6c8d1498

eloyfelix commented 1 year ago

I also get several errors like:

SMILES saver: character 0xd is not allowed inside pseudo-atom
SMILES saver: ';' not allowed inside pseudo-atom

which I guess might be related? I can provide examples

from indigo import Indigo

indigo = Indigo()

for mol in indigo.iterateCDXFile("US11267801-20220308-C00027.CDX"):
    print("heavy atoms:", mol.countHeavyAtoms())
    print("smiles:", mol.smiles())
heavy atoms: 50
---------------------------------------------------------------------------
IndigoException                           Traceback (most recent call last)
Cell In[4], line 7
      5 for mol in indigo.iterateCDXFile("US11267801-20220308-C00027.CDX"):
      6     print("heavy atoms:", mol.countHeavyAtoms())
----> 7     print("smiles:", mol.smiles())

File [~/.pyenv/versions/3.11.3/envs/uspto-patents/lib/python3.11/site-packages/indigo/indigo/indigo_object.py:3375](https://file+.vscode-resource.vscode-cdn.net/home/eloy/patent_chem/example_files/~/.pyenv/versions/3.11.3/envs/uspto-patents/lib/python3.11/site-packages/indigo/indigo/indigo_object.py:3375), in IndigoObject.smiles(self)
   3368 def smiles(self):
   3369     """Molecule or reaction method calculates SMILES for the structure
   3370 
   3371     Returns:
   3372         str: smiles string
   3373     """
-> 3375     return IndigoLib.checkResultString(self._lib().indigoSmiles(self.id))

File [~/.pyenv/versions/3.11.3/envs/uspto-patents/lib/python3.11/site-packages/indigo/indigo/indigo_lib.py:1026](https://file+.vscode-resource.vscode-cdn.net/home/eloy/patent_chem/example_files/~/.pyenv/versions/3.11.3/envs/uspto-patents/lib/python3.11/site-packages/indigo/indigo/indigo_lib.py:1026), in IndigoLib.checkResultString(result, exception_class)
   1022 @staticmethod
   1023 def checkResultString(
   1024     result: bytes, exception_class: Type[Exception] = IndigoException
   1025 ):
-> 1026     return IndigoLib.checkResultPtr(result, exception_class).decode()

File [~/.pyenv/versions/3.11.3/envs/uspto-patents/lib/python3.11/site-packages/indigo/indigo/indigo_lib.py:1019](https://file+.vscode-resource.vscode-cdn.net/home/eloy/patent_chem/example_files/~/.pyenv/versions/3.11.3/envs/uspto-patents/lib/python3.11/site-packages/indigo/indigo/indigo_lib.py:1019), in IndigoLib.checkResultPtr(result, exception_class)
   1017 if result is None:
   1018     assert IndigoLib.lib
-> 1019     raise exception_class(IndigoLib.lib.indigoGetLastError())
   1020 return result

IndigoException: SMILES saver: character 0xd is not allowed inside pseudo-atom

US11267801-20220308-C00027.zip

eloyfelix commented 1 year ago

In case it is of any help, generating the SMILES as "daylight" and not as the default "chemaxon"make things work for both examples.

from indigo import Indigo

indigo = Indigo()

indigo.setOption("smiles-saving-format", "daylight")

for mol in indigo.iterateCDXFile("US06168253-20010102-C00003.CDX"):
    print("heavy atoms:", mol.countHeavyAtoms())
    print("smiles:", mol.smiles())

for mol in indigo.iterateCDXFile("US11267801-20220308-C00027.CDX"):
    print("heavy atoms:", mol.countHeavyAtoms())
    print("smiles:", mol.smiles())