For the purpose of this PR, the files are in either Scripts/Structures or Biochemistry/Structures.
The branch was developed in conjunction with my work on updating KEGG biochemistry. The iterative spot-checking of compound structures and formulas when checking that they merged correctly led me to a better means of comparing structures and formulas independently of the database itself.
The two key scripts are :
Print_Structure_Formula_Charge.py
List_ModelSEED_Structures.py
The former is run on the original and charged structures from both KEGG and MetaCyc, and produces two files containing the formula and charge as computed using RDKit and OpenBabel:
KEGG/InChI_Charged_Formulas_Charges.txt
MetaCyc/InChI_Charged_Formulas_Charges.txt
the latter will attempt to merge all the structures into a single overarching list using aliases. It produces a master list:
All_ModelSEED_Structures.txt
and attempts to find a correct set of unique structures (i.e. leaving out ModelSEED compounds for which there are multiple structures that have a conflict):
Unique_ModelSEED_Structures.txt
It'll print all structural conflicts, and, all structural conflicts where the formula differs, these last two files aren't added in this PR but can be generated.
Structural conflicts will be resolved by either manually selecting the right structure or disambiguating a compound into two separate structures.
I extended List_ModelSEED_Structures.py so now it actually includes the calculated formulas and charges in All_ModelSEED_Structures.txt it'll be easier to use this way. Merging ahead of MetaCyc update.
For the purpose of this PR, the files are in either Scripts/Structures or Biochemistry/Structures.
The branch was developed in conjunction with my work on updating KEGG biochemistry. The iterative spot-checking of compound structures and formulas when checking that they merged correctly led me to a better means of comparing structures and formulas independently of the database itself.
The two key scripts are : Print_Structure_Formula_Charge.py List_ModelSEED_Structures.py
The former is run on the original and charged structures from both KEGG and MetaCyc, and produces two files containing the formula and charge as computed using RDKit and OpenBabel: KEGG/InChI_Charged_Formulas_Charges.txt MetaCyc/InChI_Charged_Formulas_Charges.txt
the latter will attempt to merge all the structures into a single overarching list using aliases. It produces a master list: All_ModelSEED_Structures.txt
and attempts to find a correct set of unique structures (i.e. leaving out ModelSEED compounds for which there are multiple structures that have a conflict): Unique_ModelSEED_Structures.txt
It'll print all structural conflicts, and, all structural conflicts where the formula differs, these last two files aren't added in this PR but can be generated.
Structural conflicts will be resolved by either manually selecting the right structure or disambiguating a compound into two separate structures.