Closed samseaver closed 4 years ago
Do the aliases files in the Aliases folder need to be regenerated for the merged biochemistry? Can the aliases files use a two column format with the external ID and the ProbModelSEED ID?
The aliases don't need to be regenerated, but a new file, with the proposed output, can be generated while running PrintMaster(Compounds|Reactions). But we don't want to over-write the current list of aliases because they reflect the respective states of the two databases we're merging.
A couple of commits from today and earlier this summer were a step towards an independent biochemistry, allow me to effectively test reactions without having to use either codebase, see scripts/Rebalance_Reactions.pl
If we're substantially refactoring code, we might consider using python, and more importantly, integrating in some of the very nice python chemistry libraries which could substantially enhance our capabilities and reduce the amount of custom code we need to write.
There's no substantial refactoring here, almost literally copying code from the ModelSEED codebase, but if we convert this repository to an SDK, then yea, re-writing as Python would be a necessary step.
I'm planning on setting up TravisCI to initially run Mike's validation scripts on each commit on my own repo. once that's working I suggest we integrate them here. I will write and submit more rigorous checkers based on RDKit soon.
This issue will serve to be a place-holder for the list of things I need to check in the Biochemistry database, to ensure its quality.
Generally speaking, there's several goals for the biochemistry:
1) All conditional/universal reactions in the ModelTemplates have an "OK" status. 2) All gapfilling reactions in ModelTemplates have an "OK" status. 3) All reactions from various sources are appropriately merged. 4) As many reactions as possible have an "OK" status.
The first two goals are the highest priority for testing and release in ProbModelSEED.
However, it then follows that there's a series of sub-tasks that arise from trying to achieve these goals:
1) Editing compound charge and formula 2) Merging compounds 3) Splitting compounds (tricky) 4) Checking compound structures 5) Checking compound aliases
In case #3, there are compounds where there was an incorrect merge between database sources (usually discovered via aliases) and "splitting" them may result in the generation of a new compound object (and identifier), and consequently in new reactions (and identifiers). Where possible, I'll attempt this last.