ModelSEED / ModelSEEDDatabase

This repository contains the definitive copy of the biochemistry and metadata used to construct models using the ModelSEED/ProbAnno approach
Other
52 stars 38 forks source link

Updating Structures, Formulas, Charges #116

Closed samseaver closed 5 years ago

samseaver commented 5 years ago

Don't merge this, this is my second attempt.

I integrated, and double-checked the code, for listing the new structural InChI and SMILES strings. However, it should be noted that, whilst the purchase of the new MarvinBeans license is pending, the "Charged" strings, which are used to determine formula and charge, are not yet updated, so this update is really just the addition of a few hundred more strings.

I'm using RDKit now, which is more reliable, with OpenBabel as a fall-back because RDKit fails on some unusual valences, but the result of this, as I'm using a different SMILES formatter, is that a lot of the SMILES strings look different, though they represent the same molecular object. This will likely change again when I get the chance to test MarvinBeans representation.

My next step is to work on generating formulas, and I'll be using the same RDKit>OpenBabel combination to generate them reliably from InChI (and then SMILES if InChI not available).

samseaver commented 5 years ago

The last update allowed for the updating of reaction status where the status was degraded from "OK". This indicates a problem, there's only 73 of them, and 12 of them are conditional reactions in the microbial templates, so I'll attempt to curate those.

samseaver commented 5 years ago

So, I found a way in which my code was skipping the use of SMILES strings where InChI wasn't available, which wasn't my original intention, the result is that the formula for a lot more compounds were updated approriately, again, using code from RDKit and OpenBabel in ./Update_Formula_Charge.py.

This led to another 2000+ reactions whose status was improved to "OK" from an imbalance, and the biochemistry now has over 25K balanced reactions, over 70%. Unfortunately, this also led to the degradation of other reactions from "OK" to an imbalance, 80 of which are in the templates.

I've kept a record of these and will be incrementally curating them, and the last two commits here is from one attempt. It seems that some records for Gibberellin A1 (GA1) were incorrectly merged with a glycan (http://rest.kegg.jp/get/C06136 ; GA1) both of which have associated structures. Fortunately, there is an original record for GA1, so it was easy for me to make sure the right compound id and structures were used in the right reactions.

samseaver commented 5 years ago

I left this open for a while, in case I thought of anything. It passed all checks, so I'm merging.