ModelSEED / ModelSEEDDatabase

This repository contains the definitive copy of the biochemistry and metadata used to construct models using the ModelSEED/ProbAnno approach
Other
52 stars 38 forks source link

Merging formulas to normalize them despite sources. Added Notes column #68

Closed samseaver closed 6 years ago

samseaver commented 6 years ago

This is a bit of a double-whammy, all formulas updated so that there's no extra characters, parentheses, multipliers, etc. so to normalize across the board, and make it easier when balancing reactions.

In addition, if formulas had multipliers, either explicit or implicit (i.e. '(formula)[xn*]') we take note in the Notes column, as "PO" for polymers.

samseaver commented 6 years ago

@JamesJeffryes The output of the travis checks is massive, is there a place where we can find a summary of the issues blocking the merge?

JamesJeffryes commented 6 years ago

Generally you have to look at the bottom of the raw log. The build is failing because of this error 19 new unbalanced errors detected Basically there are 19 more non-obsolete reactions that are unbalanced (according to the formulas, not the status column) relative to the master branch. Unfortunately it does not track which ones changed at the moment (it just stores the overall count). I am guessing this is probably caused by compounds that did not have a valid formula now bing parseable but because the "notes" change affects every line I cant really tell.

mmundy42 commented 6 years ago

Take a look at cpd00021, cpd00030, cpd00058, cpd00149, and cpd00244. The formulas include the charge.

samseaver commented 6 years ago

Yes, with hindsight, it was a mistake to commit both a massive change (i.e. the addition of a single column) and spot-changes (i.e. the update of different formulas).

Is Travis monitoring whether reaction balances are changing ? We should probably clear up why Travis had these 19 reactions listed as "balanced" in the first place.

samseaver commented 6 years ago

In any case, yes, I'm very confident in the formula-parsing-re-merging I did, here are some examples:

Updating cpd17447: (SO4. Ca)2. H2O --> H2Ca2O9S2 Updating cpd17517: SO4. Co --> CoO4S Updating cpd20546: Mg(Al,Fe)Si4O10(OH).4H2O --> H9AlFeMgO15Si4

I was also going to follow this up with the new set of reaction balances too, so we can see which reactions would have a different status, so lets merge, and then review.

Finally, this goes back to what I was saying about using pretty-printed JSON objects, it helps see the diffs between individual columns.

JamesJeffryes commented 6 years ago

I suspect they were "balanced" because one or more of their compounds had an unparseable formula. Here's part of the diff of the raw log of your PR (<) vs the master branch (>) that might help

< Number of reactions with unbalanced equations: 7475


Number of reactions with unbalanced equations: 7456

22208,22209c22893,22894

< Reactant atoms:[('H', 4.0), ('O', 7.0), ('P', 2.0)]

< Product atoms:[('H', 4.0), ('O', 12.0), ('P', 4.0)]


Reactant atoms:[('H', 3.0), ('O', 4.0), ('P', 1.0)]

Product atoms:[('H', 2.0), ('O', 6.0), ('P', 2.0)]

22309c22994

< Product atoms:[('C', 10.0), ('H', 14.0), ('N', 2.0), ('O', 6.0)]


Product atoms:[('C', 5.0), ('H', 7.0), ('N', 1.0), ('O', 3.0)]

22673c23358

< Product atoms:[('C', 44.0), ('H', 68.0), ('N', 4.0), ('O', 34.0)]


Product atoms:[('C', 22.0), ('H', 34.0), ('N', 2.0), ('O', 17.0)]

23000,23001c23685,23686

< Reactant atoms:[('C', 34.0), ('H', 49.0), ('N', 7.0), ('O', 47.0), ('P', 2.0), ('S', 5.0)]

< Product atoms:[('C', 62.0), ('H', 91.0), ('N', 9.0), ('O', 82.0), ('P', 2.0), ('S', 10.0)]


Reactant atoms:[('C', 22.0), ('H', 30.0), ('N', 6.0), ('O', 30.0), ('P', 2.0), ('S', 3.0)]

Product atoms:[('C', 36.0), ('H', 51.0), ('N', 7.0), ('O', 46.0), ('P', 2.0), ('S', 5.0)]

23023,23030d23707

< Line 05964: rxn06114

< Reactant atoms:[('C', 38.0), ('H', 53.0), ('N', 7.0), ('O', 35.0), ('P', 2.0), ('S', 1.0)]

< Product atoms:[('C', 38.0), ('H', 53.0), ('N', 7.0), ('O', 38.0), ('P', 2.0), ('S', 2.0)]

<

< Line 05965: rxn06115

< Reactant atoms:[('C', 38.0), ('H', 53.0), ('N', 7.0), ('O', 35.0), ('P', 2.0), ('S', 1.0)]

< Product atoms:[('C', 38.0), ('H', 53.0), ('N', 7.0), ('O', 38.0), ('P', 2.0), ('S', 2.0)]

<

23124c23801

< Reactant atoms:[('C', 52.0), ('H', 74.0), ('O', 49.0)]


Reactant atoms:[('C', 26.0), ('H', 38.0), ('O', 25.0)]

23160c23837

< Reactant atoms:[('C', 28.0), ('H', 42.0), ('N', 2.0), ('O', 24.0)]


Reactant atoms:[('C', 14.0), ('H', 21.0), ('N', 1.0), ('O', 12.0)]

23243,23254d23919

< Line 06068: rxn06225

< Reactant atoms:[('C', 25.0), ('H', 44.0), ('N', 2.0), ('O', 31.0), ('P', 4.0)]

< Product atoms:[('C', 31.0), ('H', 54.0), ('N', 2.0), ('O', 36.0), ('P', 4.0)]

<

< Line 06069: rxn06226

< Reactant atoms:[('C', 27.0), ('H', 47.0), ('N', 3.0), ('O', 31.0), ('P', 4.0)]

< Product atoms:[('C', 35.0), ('H', 60.0), ('N', 4.0), ('O', 36.0), ('P', 4.0)]

<

< Line 06070: rxn06227

< Reactant atoms:[('C', 23.0), ('H', 42.0), ('N', 6.0), ('O', 29.0), ('P', 5.0)]

< Product atoms:[('C', 26.0), ('H', 47.0), ('N', 7.0), ('O', 30.0), ('P', 5.0)]

<

23413c24078

< Product atoms:[('C', 64.0), ('H', 96.0), ('N', 12.0), ('O', 36.0), ('R', 4.0)]


Product atoms:[('C', 32.0), ('H', 48.0), ('N', 6.0), ('O', 18.0), ('R', 2.0)]

23448c24113

< Reactant atoms:[('C', 4.0), ('H', 8.0), ('O', 4.0)]


Reactant atoms:[('C', 2.0), ('H', 4.0), ('O', 3.0)]

23452c24117

< Reactant atoms:[('C', 4.0), ('H', 10.0), ('N', 1.0), ('O', 2.0), ('R', 1.0)]


Reactant atoms:[('C', 2.0), ('H', 6.0), ('N', 1.0), ('O', 1.0), ('R', 1.0)]

23644c24309

< Reactant atoms:[('C', 12.0), ('H', 16.0), ('O', 12.0)]


Reactant atoms:[('C', 6.0), ('H', 8.0), ('O', 6.0)]

23789c24454

< Product atoms:[('C', 10.0), ('H', 17.0), ('O', 7.0), ('P', 2.0)]


Product atoms:[('C', 5.0), ('H', 9.0), ('O', 7.0), ('P', 2.0)]

23860,23861c24525,24526

< Reactant atoms:[('C', 78.0), ('H', 119.0), ('N', 11.0), ('O', 97.0), ('P', 2.0), ('S', 11.0)]

< Product atoms:[('C', 34.0), ('H', 49.0), ('N', 7.0), ('O', 56.0), ('P', 2.0), ('S', 8.0)]


Reactant atoms:[('C', 44.0), ('H', 65.0), ('N', 8.0), ('O', 55.0), ('P', 2.0), ('S', 6.0)]

Product atoms:[('C', 22.0), ('H', 30.0), ('N', 6.0), ('O', 33.0), ('P', 2.0), ('S', 4.0)]

23884,23885c24549,24550

< Reactant atoms:[('C', 28.0), ('H', 42.0), ('N', 2.0), ('O', 22.0)]

< Product atoms:[('C', 56.0), ('H', 84.0), ('N', 4.0), ('O', 62.0), ('S', 6.0)]


Reactant atoms:[('C', 14.0), ('H', 21.0), ('N', 1.0), ('O', 11.0)]

Product atoms:[('C', 28.0), ('H', 42.0), ('N', 2.0), ('O', 31.0), ('S', 3.0)]

24011,24014d24675

< Line 06489: rxn06657

< Reactant atoms:[('C', 57.0), ('H', 87.0), ('N', 6.0), ('O', 41.0), ('S', 1.0)]

< Product atoms:[('C', 58.0), ('H', 89.0), ('N', 6.0), ('O', 41.0), ('S', 1.0)]

<

24039,24042d24699

< Line 06511: rxn06679

< Reactant atoms:[('C', 51.0), ('H', 83.0), ('N', 6.0), ('O', 35.0), ('S', 1.0)]

< Product atoms:[('C', 52.0), ('H', 85.0), ('N', 6.0), ('O', 35.0), ('S', 1.0)]

<

24437c25094

< Product atoms:[('C', 61.0), ('H', 84.0), ('N', 2.0), ('O', 60.0), ('P', 2.0)]


Product atoms:[('C', 35.0), ('H', 48.0), ('N', 2.0), ('O', 36.0), ('P', 2.0)]

24477c25134

< Product atoms:[('C', 37.0), ('H', 54.0), ('N', 4.0), ('O', 36.0), ('P', 2.0)]


Product atoms:[('C', 23.0), ('H', 33.0), ('N', 3.0), ('O', 24.0), ('P', 2.0)]

24641c25298

< Product atoms:[('C', 16.0), ('H', 28.0), ('O', 4.0)]


Product atoms:[('C', 8.0), ('H', 14.0), ('O', 2.0)]

24645c25302

< Product atoms:[('C', 8.0), ('H', 12.0), ('O', 4.0)]


Product atoms:[('C', 4.0), ('H', 6.0), ('O', 2.0)]

25192,25193c25849,25850

< Reactant atoms:[('C', 66.0), ('H', 96.0), ('N', 9.0), ('O', 75.0), ('P', 2.0), ('S', 7.0)]

< Product atoms:[('C', 38.0), ('H', 53.0), ('N', 7.0), ('O', 38.0), ('P', 2.0), ('S', 2.0)]


Reactant atoms:[('C', 38.0), ('H', 54.0), ('N', 7.0), ('O', 44.0), ('P', 2.0), ('S', 4.0)]

Product atoms:[('C', 24.0), ('H', 32.0), ('N', 6.0), ('O', 24.0), ('P', 2.0), ('S', 1.0)]

26195,26198d26851

< Line 09470: rxn09995

< Reactant atoms:[('C', 10.0), ('H', 18.0), ('O', 9.0)]

< Product atoms:[('C', 5.0), ('H', 10.0), ('O', 5.0)]

<

27116c27769

< Reactant atoms:[('C', 10.0), ('H', 18.0), ('O', 9.0)]


Reactant atoms:[('C', 5.0), ('H', 10.0), ('O', 5.0)]

27868c28521

< Reactant atoms:[('C', 12.0), ('H', 16.0), ('O', 12.0)]


Reactant atoms:[('C', 6.0), ('H', 8.0), ('O', 6.0)]

29524c30177

< Reactant atoms:[('C', 52.0), ('H', 74.0), ('O', 49.0)]


Reactant atoms:[('C', 26.0), ('H', 38.0), ('O', 25.0)]

29881c30534,30538

< Product atoms:[('C', 32.0), ('H', 42.0), ('N', 10.0), ('O', 34.0), ('P', 4.0)]


Product atoms:[('C', 26.0), ('H', 34.0), ('N', 10.0), ('O', 28.0), ('P', 4.0)]

Line 12641: rxn13807

Reactant atoms:[('C', 6.0), ('H', 8.0), ('O', 6.0)]

Product atoms:[('C', 12.0), ('H', 16.0), ('O', 12.0)]

30252,30253c30909,30910

< Reactant atoms:[('C', 12.0), ('H', 16.0), ('O', 14.0)]

< Product atoms:[('C', 12.0), ('H', 16.0), ('O', 12.0)]


Reactant atoms:[('C', 6.0), ('H', 8.0), ('O', 7.0)]

Product atoms:[('C', 6.0), ('H', 8.0), ('O', 6.0)]

30341c30998

< Product atoms:[('H', 8.0), ('O', 24.0), ('P', 8.0)]


Product atoms:[('H', 4.0), ('O', 12.0), ('P', 4.0)]

30375,30378d31031

< Line 13339: rxn15088

< Reactant atoms:[('C', 12.0), ('H', 22.0), ('O', 11.0)]

< Product atoms:[('C', 6.0), ('H', 12.0), ('O', 6.0)]

<

30400c31053

< Reactant atoms:[('C', 12.0), ('H', 22.0), ('O', 11.0)]


Reactant atoms:[('C', 6.0), ('H', 12.0), ('O', 6.0)]

30404c31057

< Reactant atoms:[('C', 12.0), ('H', 22.0), ('O', 11.0)]


Reactant atoms:[('C', 6.0), ('H', 12.0), ('O', 6.0)]

30408c31061

< Reactant atoms:[('C', 12.0), ('H', 22.0), ('O', 11.0)]


Reactant atoms:[('C', 6.0), ('H', 12.0), ('O', 6.0)]

30476c31129

< Reactant atoms:[('C', 52.0), ('H', 74.0), ('O', 49.0)]


Reactant atoms:[('C', 26.0), ('H', 38.0), ('O', 25.0)]

30535,30538d31187

< Line 13657: rxn15434

< Reactant atoms:[('C', 5.0), ('H', 10.0), ('O', 7.0), ('P', 2.0)]

< Product atoms:[('C', 10.0), ('H', 18.0), ('O', 7.0), ('P', 2.0)]

<

30601c31250

< Product atoms:[('C', 46.0), ('H', 66.0), ('N', 6.0), ('O', 48.0), ('P', 4.0)]


Product atoms:[('C', 32.0), ('H', 45.0), ('N', 5.0), ('O', 36.0), ('P', 4.0)]

30627,30634d31275

< Line 13777: rxn15556

< Reactant atoms:[('C', 12.0), ('H', 21.0), ('O', 14.0), ('P', 1.0)]

< Product atoms:[('C', 6.0), ('H', 11.0), ('O', 9.0), ('P', 1.0)]

<

< Line 13778: rxn15557

< Reactant atoms:[('C', 15.0), ('H', 22.0), ('N', 2.0), ('O', 17.0), ('P', 2.0)]

< Product atoms:[('C', 21.0), ('H', 32.0), ('N', 2.0), ('O', 22.0), ('P', 2.0)]

<

30639,30646d31279

< Line 13780: rxn15559

< Reactant atoms:[('C', 22.0), ('H', 33.0), ('N', 5.0), ('O', 21.0), ('P', 2.0)]

< Product atoms:[('C', 16.0), ('H', 23.0), ('N', 5.0), ('O', 16.0), ('P', 2.0)]

<

< Line 13782: rxn15561

< Reactant atoms:[('C', 15.0), ('H', 22.0), ('N', 2.0), ('O', 17.0), ('P', 2.0)]

< Product atoms:[('C', 21.0), ('H', 32.0), ('N', 2.0), ('O', 22.0), ('P', 2.0)]

<

30669c31302

< Product atoms:[('C', 30.0), ('H', 52.0), ('O', 26.0)]


Product atoms:[('C', 18.0), ('H', 32.0), ('O', 16.0)]

30683,30686d31315

< Line 13796: rxn15575

< Reactant atoms:[('C', 12.0), ('H', 22.0), ('O', 11.0)]

< Product atoms:[('C', 6.0), ('H', 12.0), ('O', 6.0)]

<

30703,30706d31331

< Line 13802: rxn15581

< Reactant atoms:[('C', 14.0), ('H', 20.0), ('N', 2.0), ('O', 16.0), ('P', 2.0)]

< Product atoms:[('C', 19.0), ('H', 28.0), ('N', 2.0), ('O', 20.0), ('P', 2.0)]

<

30721c31346

< Product atoms:[('C', 24.0), ('H', 40.0), ('O', 20.0)]


Product atoms:[('C', 12.0), ('H', 20.0), ('O', 10.0)]

30731,30734d31355

< Line 13817: rxn15597

< Reactant atoms:[('C', 24.0), ('H', 42.0), ('O', 21.0)]

< Product atoms:[('C', 12.0), ('H', 22.0), ('O', 11.0)]

<

30743,30750d31363

< Line 13820: rxn15600

< Reactant atoms:[('C', 5.0), ('H', 10.0), ('O', 5.0)]

< Product atoms:[('C', 10.0), ('H', 18.0), ('O', 9.0)]

<

< Line 13821: rxn15601

< Reactant atoms:[('C', 5.0), ('H', 10.0), ('O', 5.0)]

< Product atoms:[('C', 10.0), ('H', 18.0), ('O', 9.0)]

<

32656c33269

< Reactant atoms:[('C', 25.0), ('H', 44.0), ('N', 2.0), ('O', 31.0), ('P', 4.0)]


Reactant atoms:[('C', 20.0), ('H', 33.0), ('N', 2.0), ('O', 24.0), ('P', 3.0)]

33489c34102

< Product atoms:[('C', 28.0), ('H', 42.0), ('N', 2.0), ('O', 24.0)]


Product atoms:[('C', 14.0), ('H', 21.0), ('N', 1.0), ('O', 12.0)]

33645c34258

< Product atoms:[('C', 24.0), ('H', 40.0), ('O', 22.0)]


Product atoms:[('C', 12.0), ('H', 20.0), ('O', 11.0)]

44452c45065

< Reactant atoms:[('C', 48.0), ('H', 82.0), ('O', 41.0)]


Reactant atoms:[('C', 24.0), ('H', 42.0), ('O', 21.0)]

46252c46865

< Reactant atoms:[('C', 12.0), ('H', 20.0), ('O', 10.0)]


Reactant atoms:[('C', 6.0), ('H', 10.0), ('O', 5.0)]

46467,46470d47079

< Line 26977: rxn31251

< Reactant atoms:[('C', 24.0), ('H', 28.0), ('N', 8.0), ('O', 9.0)] < Product atoms:[('C', 29.0), ('H', 35.0), ('N', 9.0), ('O', 12.0)]

On Mon, Sep 18, 2017 at 4:11 PM, samseaver notifications@github.com wrote:

Yes, with hindsight, it was a mistake to commit both a massive change (i.e. the addition of a single column) and spot-changes (i.e. the update of different formulas).

Is Travis monitoring whether reaction balances are changing ? We should probably clear up why Travis had these 19 reactions listed as "balanced" in the first place.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ModelSEED/ModelSEEDDatabase/pull/68#issuecomment-330357508, or mute the thread https://github.com/notifications/unsubscribe-auth/AIGfOTpmLVDIV6Sk3VXtFwzlH4X-FBIlks5sjtxrgaJpZM4PZKFC .

JamesJeffryes commented 6 years ago

basically any rxn id that appears above is a new "unbalanced reaction". I would love switch everything over to pretty print json! I'd have to rewrite the tests but it would be more legible than TSV @mmundy42 do you have a lot code outside this repo that uses the current tsv schema?

samseaver commented 6 years ago

I spot-checked one of these: rxn06114

This happens: Updating cpd11670: C14H21NO112 --> C28H42N2O22 Updating cpd11716: C14H21NO14S2 --> C28H42N2O28S2

The stored formula had a multiplier. However, in my current reaction balancing code (not yet committed), the status of rxn06114 is still "OK".

samseaver commented 6 years ago

Ah-ha, I just found a filter I had in my reaction balancing code, so now I can see that I'm getting the same newly reported imbalances. I'm recording them, and what I will do is I will follow up on them individually to clean up those imbalances and re-commit.

mmundy42 commented 6 years ago

Yes, I have some code that uses the tsv schema but it would be easy to switch to a JSON schema.

One thing to consider is how to handle fields that have no values in tsv files. Would all "records" in the JSON file have a key for every field?

Also, it would be good to make sure that fields are consistently defined. For example, in a biomasses file, all of the numeric values should be floats. I'd suggest defining a json-schema for each file and using the jsonschema Python package to validate the files.

JamesJeffryes commented 6 years ago

Love the idea of schema validation as part of the tests. I would say "field": null for missing info is my preference. It's explicit which is better for manual curation and schema validation.

samseaver commented 6 years ago

Sweet! Updating my code to specifically assume most multipliers should default to '1' results in passed checks on reaction balancing.

samseaver commented 6 years ago

I've been using a "null" text string for the TSV schema, I'm ambivalent as to whether we should maintain using a "null" string or a true null value.

JamesJeffryes commented 6 years ago

Great! This is how the checks are supposed to work.