SysBioChalmers / Yarrowia_lipolytica_W29-GEM

Genome-scale model of Yarrowia lipolytica.
http://www.nature.com/articles/npjsba20165
Creative Commons Attribution 4.0 International
8 stars 8 forks source link

Fix sbml errors #5

Closed ChristianLieven closed 6 years ago

ChristianLieven commented 6 years ago

In this branch I've tried to fix the three main errors that I saw when validating the current Yali model on http://sbml.org/Facilities/Validator/:

(SBML Validation Rule #fbc-20603) A <fluxObjective> object must have the required attributes 'fbc:reaction' and 'fbc:coefficient', and may have the optional attributes 'fbc:id' and 'fbc:name'. No other attributes from the SBML Level 3 Flux Balance Constraints namespace are permitted on a <fluxObjective> object. Reference: L3V1 Fbc V2, Section 3.7 Fbc attribute 'reaction' is missing.
(SBML Validation Rule #fbc-21203) A <GeneProduct> object must have the required attributes 'fbc:id' and 'fbc:label' may have the optional attributes 'fbc:name' and 'fbc:associatedSpecies'. No other attributes from the SBML Level 3 Flux Balance Constraints namespace are permitted on a <GeneProduct> object. Reference: L3V1 Fbc V2 Section 3.5 Fbc attribute 'label' is missing from 'geneProduct' object.
 (SBML Validation Rule #fbc-20303) The value of attribute 'fbc:chemicalFormula' on the SBML <species> object must be set to a string consisting only of atomic names or user defined compounds and their occurrence. Reference: L3V1 Fbc V2 Section 3.4 Encountered '(' when expecting a capital letter.

In addition, I added a model ID (simply "iYali4") so that the SBML can be imported with cobrapy. I was able to remove the parenthesis from 33 metabolite formulae using the formulae defined for these metabolites in yeast-GEM (https://github.com/SysBioChalmers/yeast-GEM). 16 metabolites with formulae remain, which, technically still means that the SBML is invalid, but at least in cobrapy these errors are only treated as warnings and hence I/O is now possible.

The remaining metabolites are:

[('(1-3)-beta-D-glucan', '(C6H10O5)n'),
 ('(1-3)-beta-D-glucan', '(C6H10O5)n'),
 ('(1-6)-beta-D-glucan', 'C12H22O11(C6H10O5)n'),
 ('alpha-D-mannosyl-beta-D-mannosyldiacetylchitobiosyldiphosphodolichol',
  'C38H68N2O27P2(C5H8)n'),
 ('beta-D-mannosyldiacetylchitobiosyldiphosphodolichol',
  'C32H58N2O22P2(C5H8)n'),
 ('beta-D-mannosyldiacetylchitobiosyldiphosphodolichol',
  'C32H58N2O22P2(C5H8)n'),
 ('chitin', 'H2O(C8H13NO5)n'),
 ('chitosan', 'H2O(C6H12NO4)n'),
 ('dolichol', 'C20H36O(C5H8)n'),
 ('dolichyl D-mannosyl phosphate', 'C26H47O9P(C5H8)n'),
 ('dolichyl phosphate', 'C20H37O4P(C5H8)n'),
 ('dolichyl phosphate', 'C20H37O4P(C5H8)n'),
 ('lipoylprotein', 'C12H18N2O4S2R2(C2H2NOR)n'),
 ("N,N'-diacetylchitobiosyldiphosphodolichol", 'C36H64N2O17P2(C5H8)n'),
 ('pectin', 'H2O(C6H8O6)n'),
 ('(1-3)-beta-D-glucan', '(C6H10O5)n')]
edkerk commented 6 years ago

Thanks @ChristianLieven, will go through these changes. A few remarks:

ChristianLieven commented 6 years ago

Thanks for your comments Ed!

model.id was missing in SBML as it was specified as iYali4.1, with the invalid . character. Similar to yeast-GEM, we'll move away from using model version in filenames.

This would break import capability with Cobrapy as it requires models to have an ID. At least, with the current SBML parser.

The labels that are added to genes in your PR are S. cerevisiae gene identifiers.

I actually didn't change existing labels. I merely used the gene metaids as label wherever a label was missing. When I checked it seemed to me that most metaid values are the same as the actual gene IDs matching the following regex: YALI[0-9][a-z][0-9]*g. Can you please point out which genes/ reactions you are refering to specifically? Is it possible that a handful of genes use the S. cerevisiae identifiers in the base Yali Model that I started with?

Also, as the default routine is to prepare all model files using the exportForGit function in RAVEN, it is crucial that these labels are correctly exported by those functions, as it would otherwise be lost in the next I/O cycle.

I'm not sure I understand you here. I'll take a look at that function, but I think it would help me if you could point out what output you expected to see. As you know I'm using cobrapy to manipulate the model, hence if you think there is information missing now that used to be there, I believe it may be an issue with the cobrapy parser.

Regarding the polymers that you cite, wouldn't it be fair to represent them as monomers, because that is how they are actually included in the model.

I hadn't actually checked that, nor the mass-charge-balance of the reactions yet. My primary goal with this PR was to get the SBML into a shape that allowed I/O with cobrapy. I totally agree and think its very reasonable if that is how they are represented anyways. Ultimately, I left them out since that's also how they were handeld in yeast-GEM, I'm happy to fix those up too though.

edkerk commented 6 years ago

This would break import capability with Cobrapy as it requires models to have an ID. At least, with the current SBML parser.

I didn't mean getting rid of model ID, just getting rid of mentioning of versions in filenames and model IDs, having the additional effect that no illegal characters will be used for model ID. The model.id will be iYali.

I actually didn't change existing labels. I merely used the gene metaids as label wherever a label was missing. When I checked it seemed to me that most metaid values are the same as the actual gene IDs matching the following regex: YALI[0-9][a-z][0-9]*g. Can you please point out which genes/ reactions you are refering to specifically? Is it possible that a handful of genes use the S. cerevisiae identifiers in the base Yali Model that I started with?

You indeed correctly named the new fbc_label, but the existing ones were incorrect. Now I found out that for the majority of the genes the fbc_label incorrectly contained the S. cerevisiae gene name (this is probably an effect of using yeast-GEM as templaten when generating the iYali model):

      <fbc:geneProduct metaid="YALI0A04675g" sboTerm="SBO:0000252" fbc:id="YALI0A04675g" fbc:label="YAL012W"/>

Good that we find out this way!

I'm not sure I understand you here. I'll take a look at that function, but I think it would help me if you could point out what output you expected to see. As you know I'm using cobrapy to manipulate the model, hence if you think there is information missing now that used to be there, I believe it may be an issue with the cobrapy parser.

My point is that the development is done here using RAVEN, and the geneProduct errors that you highlighted are linked to bugs in RAVEN. Fixing the XML in the repository alone will not help next time I take it through an I/O-cycle with RAVEN, if not those bugs are fixed. We're now working on this.

I hadn't actually checked that, nor the mass-charge-balance of the reactions yet. My primary goal with this PR was to get the SBML into a shape that allowed I/O with cobrapy. I totally agree and think its very reasonable if that is how they are represented anyways. Ultimately, I left them out since that's also how they were handeld in yeast-GEM, I'm happy to fix those up too though.

Will take those also into account. I suggest we fix the RAVEN export functions here and I'll make a new PR with those changes and ask you to review to confirm that it is compatible with cobrapy.

ChristianLieven commented 6 years ago

Alright! Again thanks for your detailed response.

I'll close this PR then and wait for you guys to fix up RAVEN.

Let me know if you need help with the gene label situation. Shouldn't be too difficult mapping them from the YALI specific gene identifiers and the genome file and/or Uniprot?