ModelSEED / ModelSEEDDatabase

This repository contains the definitive copy of the biochemistry and metadata used to construct models using the ModelSEED/ProbAnno approach
Other
52 stars 38 forks source link

Updating and Resolving Structural conflicts #123

Closed samseaver closed 5 years ago

samseaver commented 5 years ago

For the purpose of this PR, the files are in either Scripts/Structures or Biochemistry/Structures.

The branch was developed in conjunction with my work on updating KEGG biochemistry. The iterative spot-checking of compound structures and formulas when checking that they merged correctly led me to a better means of comparing structures and formulas independently of the database itself.

The two key scripts are : Print_Structure_Formula_Charge.py List_ModelSEED_Structures.py

The former is run on the original and charged structures from both KEGG and MetaCyc, and produces two files containing the formula and charge as computed using RDKit and OpenBabel: KEGG/InChI_Charged_Formulas_Charges.txt MetaCyc/InChI_Charged_Formulas_Charges.txt

the latter will attempt to merge all the structures into a single overarching list using aliases. It produces a master list: All_ModelSEED_Structures.txt

and attempts to find a correct set of unique structures (i.e. leaving out ModelSEED compounds for which there are multiple structures that have a conflict): Unique_ModelSEED_Structures.txt

It'll print all structural conflicts, and, all structural conflicts where the formula differs, these last two files aren't added in this PR but can be generated.

Structural conflicts will be resolved by either manually selecting the right structure or disambiguating a compound into two separate structures.

samseaver commented 5 years ago

I extended List_ModelSEED_Structures.py so now it actually includes the calculated formulas and charges in All_ModelSEED_Structures.txt it'll be easier to use this way. Merging ahead of MetaCyc update.