Addition of parsed biochemistry from two primary sources

samseaver commented 5 years ago

Not to be merged yet, the files were parsed into a more digestible format, and the next step will be to add them completely to the current biochemistry. It'll involve checking:

1) aliases 2) structures 3) names 4) equations

If there's any new entities, we'll increment the current ModelSEED identifier

samseaver commented 5 years ago

So, this initial addition of KEGG compounds is incomplete, because I've still got to double-check the structures and the formulas, but, the key script "Add_New_Compounds.py" has been added. This script is a little convoluted because the compound structures are the heart of the matching process, but it goes like this:

(1) It'll check that the identifier is not already in the database (2) It'll check that the InChI structure, if available, is not already in the database (3) It'll check that the SMILE structure, if available, is not already in the database (4) It'll check that the name is not already in the database

Depending on what level a "match" occurs, the script will add names and aliases (the new KEGG ID may be added to the matched compound via matching of structure), and then, if it's truly a new compound, it'll add it to the database, the output looks like this:

Compounds matched via: ID: 17113 InChI: 140 NAMES: 52 SMILE: 11 Saving additional names for 622 compounds Saving additional KEGG aliases for 202 compounds Saving 332 new compounds from KEGG

Several scripts need to be run to follow-up, and make sure that the new data passes muster and is fully integrated, these are listed at the bottom of the Add_New_Compounds.py script.

I'm going to follow up to these commits by running these and double-checking the integration of new structures.

samseaver commented 5 years ago

There's so much back-tracking going on, and I'm acutely aware that I'm over-loading this PR, but the latest commits deserve special mention in that, in attempting to balance the new set of reactions, I've discovered that my code was dropping the '' character from SMILE strings, this latest fix corrects that, and the formula for some 3K compounds, replacing the '' with an "R" to indicate the generic group(s), I'm continuing to test reaction balancing for both the current set and new set of reactions from KEGG.

samseaver commented 5 years ago

OK, so, I had to back-track one more time as I realized I needed to account for the redundancy of compounds in the equations, including so-called "empty" reactions. I really believe I've got it all this time.

samseaver commented 5 years ago

@JamesJeffryes I want to merge this PR into dev, and work in the same branch for adding MetaCyc biochemistry, but can you review this. The two things I care the most about is matching reactions and determining whether they're balanced, the three scripts for you to skim are: Add_New_Compounds.py Add_New_Reactions.py Update_Formula_Charge.py

I'm not expecting an extensive review of any code, just to see if anything jumps out at you.

samseaver commented 5 years ago

Merging in preparation to load MetaCyc

ModelSEED / ModelSEEDDatabase

Addition of parsed biochemistry from two primary sources #118