OpenChemistry / mongochemdeploy

Scripts to install/deploy the MongoChem server/web client
BSD 3-Clause "New" or "Revised" License
7 stars 9 forks source link

Add ingest script for metatlas dataset #83

Closed cjh1 closed 3 years ago

cjh1 commented 5 years ago

So this data set has 107486 molecules, after running the ingest we end up with only 46476 molecules in the database. As part of the ingestion code we uses the inchikey to check if we already have that molecule in the database.

Looking at the raw data it looks like we have ~ 14982 molecules that have the same inchi.

Then looking at the server logs I am seeing the the following error from OpebBabel as we try to generate the inchikey:

*** Open Babel Warning  in CreateCisTrans
  Error in cis/trans stereochemistry specified for the double bond

So its having trouble creating the inchikey for a big portion of this dataset, hence they don't end up in the database ( or have the same inchikey? )

Tracing the code path from the generation of the inchikey it seems pretty torturous to me, here is the formats it seems to go through:

CJSON -- (Avogadro) --> SDF -- OpenBabel --> SMILES -- OpenBabel --> XYZ -- OpenBabel --> inchikey

@cryos @muammar @psavery Apart from anything else this conversion path seems to drop any bonding information.

cryos commented 5 years ago

That looks like a bug - we shouldn't go from SMILES to XYZ to InChI. SMILES to InChI/InChI key is preferred. Not sure why it would go to XYZ - that should be avoided in general as it forces Open Babel to guess bonding which is present in CJSON, SDF and SMILES.

cjh1 commented 5 years ago

If I remove the XYZ step from the convertion, OpenBabel gives me empty strings for the inchikeys. There is the following comment on the conversion, so I am guessing there was some issue?


# Hackish for now, convert to xyz first...
``
cjh1 commented 5 years ago

I can confirm in the case we get the warning/error from openbabel an inchikey it return, it just must be a duplicate.

cjh1 commented 5 years ago

This is a list of all the duplicate inchis in the raw dataset.

cjh1 commented 5 years ago

Here are an of example of the SMILES and its associated inchikey for which we get the error/warning for:

[CH3]=CCOC(=[CH3])O.COCC/C=[O]\[C]1(=[O][C@@H]2C(=[C]3=C(O[CH2]=C(O)O)CC(=[C](=C3[C@H]([C@@H]2C)O)O)NC(=O)/C(=C\CC[CH](=[C](=[OH])[C@@H]([CH](=[OH])[CH2][CH3])[CH3])[CH3])/[CH3])[C@H]1O)C => IGJIZKHSEUVLBN-UHFFFAOYSA-N
cjh1 commented 5 years ago

I am also seeing the following exception:

girder_1          | Traceback (most recent call last):
girder_1          |   File "/usr/lib/python3.5/concurrent/futures/_base.py", line 297, in _invoke_callbacks
girder_1          |     callback(self)
girder_1          |   File "/mongochemserver/girder/molecules/molecules/utilities/async_requests.py", line 64, in _finish_svg_gen
girder_1          |     raise ValidationException('Invalid inchikey (%s)' % inchikey)
girder_1          | girder.exceptions.ValidationException: Invalid inchikey (AWUHMVUVLZGRBZ-HHHXNRCGSA-N)
psavery commented 5 years ago

I am also seeing the following exception:

girder_1          | Traceback (most recent call last):
girder_1          |   File "/usr/lib/python3.5/concurrent/futures/_base.py", line 297, in _invoke_callbacks
girder_1          |     callback(self)
girder_1          |   File "/mongochemserver/girder/molecules/molecules/utilities/async_requests.py", line 64, in _finish_svg_gen
girder_1          |     raise ValidationException('Invalid inchikey (%s)' % inchikey)
girder_1          | girder.exceptions.ValidationException: Invalid inchikey (AWUHMVUVLZGRBZ-HHHXNRCGSA-N)

@cjh1 That is strange.The SVG generation is done asynchronously. It saves a copy of the inchikey here so it can find the molecule when it finishes to update it. That exception means the query couldn't find a molecule with that inchikey...

cjh1 commented 5 years ago

With the changes to our conversion code. I am now able to import 97884 structures.

muammar commented 5 years ago

I was checking the data and found an example that might shed some light about the issue.

JupyterLab

For the molecule C13H22O, there are 6 different isomers (?) with different inchi keys. In the workflow to create these 3D structures, I basically used the "original smiles" from the csv file in the notebook above as input for this class. And subsequently, the structure was optimized at PM3 level with ORCA. As you see, there is no use of the inchi at this point. However, to create the CJSON file I used the optimized XYZ, and converted that into inchi:

molecule = readstring('xyz', xyz_string)
obconv.SetOutFormat(str("inchi"))
obconv.AddOption(str("a"), openbabel.OBConversion.OUTOPTIONS)
inchi_text = obconv.WriteString(molecule.OBMol).split()[0]

What I should have done instead was to take the smiles and convert it into an inchi as explained by @cryos . In addition to that, there seems to be some duplicated inchis as well in the csv file. As we had discussed over video conference, that was kind of expected, too. Does this make sense now as an explanation for this?

It seems to me that the smiles I used retained isomeric information and it is safe to assume that using them to generate an inchi key is safe.

psavery commented 5 years ago

This is working very nicely for the most part!

However, I get an exception after every chunk size (default == 1000) completes, and then I have to restart it if I want to keep going. Here is my exception (which causes other exceptions):

Traceback (most recent call last):
  File "/home/patrick/virtualenvs/mongochemdeploy/lib/python3.7/site-packages/aiohttp/client_reqrep.py", line 552, in write_bytes
    await self.body.write(writer)
  File "/home/patrick/virtualenvs/mongochemdeploy/lib/python3.7/site-packages/aiohttp/payload.py", line 231, in write
    await writer.write(self._value)
  File "/home/patrick/virtualenvs/mongochemdeploy/lib/python3.7/site-packages/aiohttp/http_writer.py", line 101, in write
    self._write(chunk)
  File "/home/patrick/virtualenvs/mongochemdeploy/lib/python3.7/site-packages/aiohttp/http_writer.py", line 67, in _write
    raise ConnectionResetError('Cannot write to closing transport')
ConnectionResetError: Cannot write to closing transport
cjh1 commented 5 years ago

@psavery Did you say you where only seeing this issue on one machine?

psavery commented 5 years ago

@cjh1 No, it worked the first time I ran it on a different machine. But then I started running into the same problem again. I'm not sure what causes it, but it seems to happen more often than not for me.

cjh1 commented 4 years ago

@psavery Can you test this out again and see if you can recreate the problem you were seeing?

psavery commented 3 years ago

I have still ran into the issue I was seeing earlier (on multiple computers as well). However, I was able to fix it by using a new ClientSession for each http request, similar to what was done to fix a problem here.

This is not that desirable, as I think a ClientSession should be present for the duration of the program. However, it fixes the issue I was experiencing. I was able to go through the full ingest, and I got 98596 molecules in my database.

cjh1 commented 3 years ago

@psavery Thanks for testing this and getting to the bottom of the issue!