Closed samseaver closed 5 years ago
So, this initial addition of KEGG compounds is incomplete, because I've still got to double-check the structures and the formulas, but, the key script "Add_New_Compounds.py" has been added. This script is a little convoluted because the compound structures are the heart of the matching process, but it goes like this:
(1) It'll check that the identifier is not already in the database (2) It'll check that the InChI structure, if available, is not already in the database (3) It'll check that the SMILE structure, if available, is not already in the database (4) It'll check that the name is not already in the database
Depending on what level a "match" occurs, the script will add names and aliases (the new KEGG ID may be added to the matched compound via matching of structure), and then, if it's truly a new compound, it'll add it to the database, the output looks like this:
Compounds matched via: ID: 17113 InChI: 140 NAMES: 52 SMILE: 11 Saving additional names for 622 compounds Saving additional KEGG aliases for 202 compounds Saving 332 new compounds from KEGG
Several scripts need to be run to follow-up, and make sure that the new data passes muster and is fully integrated, these are listed at the bottom of the Add_New_Compounds.py script.
I'm going to follow up to these commits by running these and double-checking the integration of new structures.
There's so much back-tracking going on, and I'm acutely aware that I'm over-loading this PR, but the latest commits deserve special mention in that, in attempting to balance the new set of reactions, I've discovered that my code was dropping the '' character from SMILE strings, this latest fix corrects that, and the formula for some 3K compounds, replacing the '' with an "R" to indicate the generic group(s), I'm continuing to test reaction balancing for both the current set and new set of reactions from KEGG.
OK, so, I had to back-track one more time as I realized I needed to account for the redundancy of compounds in the equations, including so-called "empty" reactions. I really believe I've got it all this time.
@JamesJeffryes I want to merge this PR into dev, and work in the same branch for adding MetaCyc biochemistry, but can you review this. The two things I care the most about is matching reactions and determining whether they're balanced, the three scripts for you to skim are: Add_New_Compounds.py Add_New_Reactions.py Update_Formula_Charge.py
I'm not expecting an extensive review of any code, just to see if anything jumps out at you.
Merging in preparation to load MetaCyc
Not to be merged yet, the files were parsed into a more digestible format, and the next step will be to add them completely to the current biochemistry. It'll involve checking:
1) aliases 2) structures 3) names 4) equations
If there's any new entities, we'll increment the current ModelSEED identifier