Closed cjh1 closed 3 years ago
That looks like a bug - we shouldn't go from SMILES to XYZ to InChI. SMILES to InChI/InChI key is preferred. Not sure why it would go to XYZ - that should be avoided in general as it forces Open Babel to guess bonding which is present in CJSON, SDF and SMILES.
If I remove the XYZ step from the convertion, OpenBabel gives me empty strings for the inchikeys. There is the following comment on the conversion, so I am guessing there was some issue?
# Hackish for now, convert to xyz first...
``
I can confirm in the case we get the warning/error from openbabel an inchikey it return, it just must be a duplicate.
Here are an of example of the SMILES and its associated inchikey for which we get the error/warning for:
[CH3]=CCOC(=[CH3])O.COCC/C=[O]\[C]1(=[O][C@@H]2C(=[C]3=C(O[CH2]=C(O)O)CC(=[C](=C3[C@H]([C@@H]2C)O)O)NC(=O)/C(=C\CC[CH](=[C](=[OH])[C@@H]([CH](=[OH])[CH2][CH3])[CH3])[CH3])/[CH3])[C@H]1O)C => IGJIZKHSEUVLBN-UHFFFAOYSA-N
I am also seeing the following exception:
girder_1 | Traceback (most recent call last):
girder_1 | File "/usr/lib/python3.5/concurrent/futures/_base.py", line 297, in _invoke_callbacks
girder_1 | callback(self)
girder_1 | File "/mongochemserver/girder/molecules/molecules/utilities/async_requests.py", line 64, in _finish_svg_gen
girder_1 | raise ValidationException('Invalid inchikey (%s)' % inchikey)
girder_1 | girder.exceptions.ValidationException: Invalid inchikey (AWUHMVUVLZGRBZ-HHHXNRCGSA-N)
I am also seeing the following exception:
girder_1 | Traceback (most recent call last): girder_1 | File "/usr/lib/python3.5/concurrent/futures/_base.py", line 297, in _invoke_callbacks girder_1 | callback(self) girder_1 | File "/mongochemserver/girder/molecules/molecules/utilities/async_requests.py", line 64, in _finish_svg_gen girder_1 | raise ValidationException('Invalid inchikey (%s)' % inchikey) girder_1 | girder.exceptions.ValidationException: Invalid inchikey (AWUHMVUVLZGRBZ-HHHXNRCGSA-N)
@cjh1 That is strange.The SVG generation is done asynchronously. It saves a copy of the inchikey here so it can find the molecule when it finishes to update it. That exception means the query couldn't find a molecule with that inchikey...
With the changes to our conversion code. I am now able to import 97884 structures.
I was checking the data and found an example that might shed some light about the issue.
For the molecule C13H22O
, there are 6 different isomers (?) with different inchi keys. In the workflow to create these 3D structures, I basically used the "original smiles" from the csv
file in the notebook above as input for this class. And subsequently, the structure was optimized at PM3 level with ORCA. As you see, there is no use of the inchi at this point. However, to create the CJSON file I used the optimized XYZ, and converted that into inchi:
molecule = readstring('xyz', xyz_string)
obconv.SetOutFormat(str("inchi"))
obconv.AddOption(str("a"), openbabel.OBConversion.OUTOPTIONS)
inchi_text = obconv.WriteString(molecule.OBMol).split()[0]
What I should have done instead was to take the smiles and convert it into an inchi as explained by @cryos . In addition to that, there seems to be some duplicated inchis as well in the csv file. As we had discussed over video conference, that was kind of expected, too. Does this make sense now as an explanation for this?
It seems to me that the smiles I used retained isomeric information and it is safe to assume that using them to generate an inchi key is safe.
This is working very nicely for the most part!
However, I get an exception after every chunk size (default == 1000) completes, and then I have to restart it if I want to keep going. Here is my exception (which causes other exceptions):
Traceback (most recent call last):
File "/home/patrick/virtualenvs/mongochemdeploy/lib/python3.7/site-packages/aiohttp/client_reqrep.py", line 552, in write_bytes
await self.body.write(writer)
File "/home/patrick/virtualenvs/mongochemdeploy/lib/python3.7/site-packages/aiohttp/payload.py", line 231, in write
await writer.write(self._value)
File "/home/patrick/virtualenvs/mongochemdeploy/lib/python3.7/site-packages/aiohttp/http_writer.py", line 101, in write
self._write(chunk)
File "/home/patrick/virtualenvs/mongochemdeploy/lib/python3.7/site-packages/aiohttp/http_writer.py", line 67, in _write
raise ConnectionResetError('Cannot write to closing transport')
ConnectionResetError: Cannot write to closing transport
@psavery Did you say you where only seeing this issue on one machine?
@cjh1 No, it worked the first time I ran it on a different machine. But then I started running into the same problem again. I'm not sure what causes it, but it seems to happen more often than not for me.
@psavery Can you test this out again and see if you can recreate the problem you were seeing?
I have still ran into the issue I was seeing earlier (on multiple computers as well). However, I was able to fix it by using a new ClientSession
for each http request, similar to what was done to fix a problem here.
This is not that desirable, as I think a ClientSession
should be present for the duration of the program. However, it fixes the issue I was experiencing. I was able to go through the full ingest, and I got 98596 molecules in my database.
@psavery Thanks for testing this and getting to the bottom of the issue!
So this data set has 107486 molecules, after running the ingest we end up with only 46476 molecules in the database. As part of the ingestion code we uses the inchikey to check if we already have that molecule in the database.
Looking at the raw data it looks like we have ~ 14982 molecules that have the same inchi.
Then looking at the server logs I am seeing the the following error from OpebBabel as we try to generate the inchikey:
So its having trouble creating the inchikey for a big portion of this dataset, hence they don't end up in the database ( or have the same inchikey? )
Tracing the code path from the generation of the inchikey it seems pretty torturous to me, here is the formats it seems to go through:
CJSON -- (Avogadro) --> SDF -- OpenBabel --> SMILES -- OpenBabel --> XYZ -- OpenBabel --> inchikey
@cryos @muammar @psavery Apart from anything else this conversion path seems to drop any bonding information.