SBRG / bigg_models

The BiGG Models website server
http://bigg.ucsd.edu
Other
78 stars 18 forks source link

Some new issues #157

Open draeger opened 9 years ago

draeger commented 9 years ago

Chemical formula errors:

Gene association errors:

Elements in BiGG whose ids cannot be found not in the corresponding model:

Annotation errors:

zakandrewking commented 9 years ago

Some of these have been or are being fixed:

Not fixed:

To help keep everyone up-to-date, I have shared a Dropbox folder with the latest database dumps. There is a link on the BiGG2 README. It's here: https://www.dropbox.com/sh/yayfmcrsrtrcypw/AACDoew92pCYlSJa8vCs5rSMa?dl=0

aebrahim commented 9 years ago

as far as _SBMLDOT goes, this is done by cobrapy because NCBI gene id's are allowed to have periods, and SBML id's are not (and we seem to have standardized on NCBI gene id's). This is the only non-allowed character, so it's the only one which should be getting escaped.

zakandrewking commented 9 years ago

After discussion: we need to find out what characters are allowed in NCBI/RefSeq locus ids. If it is just [a-zA-Z0-9._], then I am fine with supporting DOT as the official way to deal with gene ids that have dots in them when exporting to SBML.

On Tue, Aug 25, 2015 at 3:26 PM, Ali Ebrahim notifications@github.com wrote:

as far as _SBMLDOT goes, this is done by cobrapy because NCBI gene id's are allowed to have periods, and SBML id's are not (and we seem to have standardized on NCBI gene id's). This is the only non-allowed character, so it's the only one which should be getting escaped.

Reply to this email directly or view it on GitHub: https://github.com/SBRG/BIGG2/issues/157#issuecomment-134759939

pillmill commented 9 years ago

Looks like: http://identifiers.org/biocyc/S-ADENOSYL-4-METHYLTHIO-2-OXOBUTANOATE

should be: http://identifiers.org/biocyc/ECOLI:S-ADENOSYL-4-METHYLTHIO-2-OXOBUTANOATE http://identifiers.org/biocyc/S-ADENOSYL-4-METHYLTHIO-2-OXOBUTANOATE

On Tue, Aug 25, 2015 at 1:12 AM, Andreas Dräger notifications@github.com wrote:

  • For reaction 'R_GULN3D' in model 'iAB_RBC_283' the geneProductAssociation in BiGG is 'Cryl1.' whereas in the model it is 'G_CRYL1'. The problem is that the identifier in BiGG is lower case and in the model it is upper case.
  • The next problem with this geneProductAssociation is that it ends with a dot. While internal dots seem to be usually replaced with ' _SBMLDOT', this needs to be documented somewhere. Second, the ending dot needs to be dropped here. I wonder if the entry is just wrong in BiGG?
  • The following link out does not work: http://identifiers.org/biocyc/S-ADENOSYL-4-METHYLTHIO-2-OXOBUTANOATE

— Reply to this email directly or view it on GitHub https://github.com/SBRG/BIGG2/issues/157.

draeger commented 9 years ago

Thanks for this update!

Square brackets around chemical formulas can be easily removed, so that's not a big problem (once discovered).

R in formulas is actually also acceptable and justified. I would therefore request this to be added to the specification of FBC. The other elements (c, e, p, X, %FULLR%, e.g. in CO2FULLR, m, hp, etc.) however, should be looked into.

Further gene ids from geneProductAssociations in BiGG that are not in the corresponding model:

There are some geneProductAssociations which cannot be parsed because they contain or or:

SELECT mr.gene_reaction_rule
FROM   model_reaction mr, reaction r, model m
WHERE  r.id = mr.reaction_id AND
       m.id = mr.model_id AND
       mr.gene_reaction_rule IS NOT NULL AND
       mr.gene_reaction_rule like '%or  or%';

GeneProductAssociation (((EcHS_A1008 and EcHS_A3302) and EcHS_A2736) or (EcHS_A1008 and EcHS_A3302) is missing a closing parenthesis and can therefore not be parsed.

I am also fine with replacing dots with __DOT__ or anything else as long as we document this somewhere. If more special cases are found, please keep me updated on the replacement rules.

About the annotation error, thanks @pillmill for looking up the correct link. The external id should be updated in BiGG.

draeger commented 9 years ago

The following SQL statement shows the lower-case chemical formulas:

SELECT bigg_id, name, TRIM(LTRIM(m.formula, '['''), ''']') AS formula
FROM   metabolite m, component c
WHERE  c.id = m.id AND length(formula) = 1 AND formula ~ '[^[:upper:]]'
ORDER BY 1, 2, 3;
pillmill commented 9 years ago

There don't appear to be any instances of 'or or' in the gene_reaction_rule in the current database. The query produces a null result.

On Tue, Aug 25, 2015 at 9:31 PM, Andreas Dräger notifications@github.com wrote:

Thanks for this update!

Square brackets around chemical formulas can be easily removed, so that's not a big problem (once discovered).

R in formulas is actually also acceptable and justified. I would therefore request this to be added to the specification of FBC. The other elements (c, e, p, X, %FULLR%, e.g. in CO2FULLR, m, hp, etc.) however, should be looked into.

Further gene ids from geneProductAssociations in BiGG that are not in the corresponding model:

  • GbraG and G__livF in iJN678
  • G_t and G_Z in iETEC_1333
  • G_Sand G_d in iMM1415
  • G_Y and G_W in iSBO_1134 Please note that those genes are also used in other models as well. I list them only once here.

There are some geneProductAssociations which cannot be parsed because they contain or or:

SELECT mr.gene_reaction_ruleFROM model_reaction mr, reaction r, model mWHERE r.id = mr.reaction_id AND m.id = mr.model_id AND mr.gene_reaction_rule IS NOT NULL AND mr.gene_reaction_rule like '%or or%';

GeneProductAssociation (((EcHS_A1008 and EcHS_A3302) and EcHS_A2736) or (EcHS_A1008 and EcHS_A3302) is missing a closing parenthesis and can therefore not be parsed.

I am also fine with replacing dots with DOT or anything else as long as we document this somewhere. If more special cases are found, please keep me updated on the replacement rules.

About the annotation error, thanks @pillmill https://github.com/pillmill for looking up the correct link. The external id should be updated in BiGG.

— Reply to this email directly or view it on GitHub https://github.com/SBRG/BIGG2/issues/157#issuecomment-134825463.

draeger commented 9 years ago

@pillmill Thanks! These are great news. I am now switching over to Zak's new dump.

draeger commented 9 years ago

I checked now all raw models with the latest version of BiGG (thanks to @jslu9 and @zakandrewking). Several issues still need to be fixed on the ModelPolisher side and I am going through those one by one. Here is now what I found in BiGG database:

Note that gene-product-rules are only taken from BiGG and what is already in the model is replaced. Rationale: BiGG is the knowledgebase that is contentiously curated and therefore most reliable.

nel3 commented 9 years ago

http://lewislab.ucsd.edu/

On Tue, Sep 1, 2015 at 12:39 PM, Andreas Dräger notifications@github.com wrote:

I checked now all raw models with the latest version of BiGG (thanks to @jslu9 https://github.com/jslu9 and @zakandrewking https://github.com/zakandrewking). Several issues still need to be fixed on the ModelPolisher side and I am going through those one by one. Here is now what I found in BiGG database:

  • Model iAB_RBC_283 does not have the gene products G_Cryl1 (used in reaction R_GULN3D), G_Prps1l1_AT2 (reaction R_PRPPS), G_S(reaction R_SBTD_D2), G_d_AT1 (reaction R_SBTD_D2).

Check with Aarash to make sure he didn't remove those intentionally (or because of his algorithm)

  • Found chemical formulas for species in lower (c, e, p etc.), see above, used in several models.
  • Model iEcHS_1320 uses gene-product association (((EcHS_A1008 and EcHS_A3302) and EcHS_A2736) or (EcHS_A1008 and EcHS_A3302), which cannot be parsed (missing closing parenthesis).
  • Model iECP_1309 makes use of the following gene products that are not in the model: G_t, G_Z (reactions R_DMSOR1pp and R_TMAOR2pp). The same reaction and genes also cause this problem in models iETEC_1333, iSFV_1184, and iSDY_1059.
  • Model iJN678uses gene products GbraG and G__livF in reaction R_UREAabcpp, but both are not defined in the model.
  • Model iMM1415 contains two gene-product associations that cannot be parsed because of or or: (Pde3a) or (Pde4b) or (Pde8a) or (Pde1b) or (Pde7a) or (Pde4c) or (Pde7a) or (Pde1c) or (Pde4d) or (Pde1a) or (Pde3b) or (Pde8b) or (Pde10a) or (Pde8a) or (Pde3b) or or (Pde2a) or (Pde1a) or (Pde8a) or (Pde11a) or (Pde7b) or (Pde8a) or (Pde8a) and also (Pde3a) or (Pde6c) or or (Pde5a) or or (Pde1a) or or (Pde11a) or (Pde5a) or (Pde10a) or or (Pde6a and Pde6b and Pde6h) or (Pde3b) or or or (Pde1c) or (Pde1a) or (Pde2a) or or (Pde1b) or (Pde5a) or or or or or or or or (Pde3b) or or or (Pde6a and Pde6c and Pde6d and Pde6b and Pde6g) or and (Uap1) or
  • Model iMM1415 also uses gene products G_S and G_din reaction R_SBTD_D2 without their declaration.
  • Model iMM1415 makes repeated use of identifier MGI for external database MGI, but this is not correct.

Check with ines or neema about this

  • Model iSBO_1134 uses the following gene products without declaration G_t, G_Z, G_Y (reaction R_DMSOR1pp), G_n, G_W (reaction R_NHFRBO), G_t, GY, G_t, G_Z (reaction R_TMAOR2pp)

Are these Jon's models?

  • There is a null identifier for weird databases in model iYO844, where I need to check how this happens. Databases read something like BG10073, CAB11787, and so on. These look themselves like ids.

Note that gene-product-rules are only taken from BiGG and what is already in the model is replaced. Rationale: BiGG is the knowledgebase that is contentiously curated and therefore most reliable.

— Reply to this email directly or view it on GitHub https://github.com/SBRG/BIGG2/issues/157#issuecomment-136836748.

draeger commented 9 years ago

The following statement yields 22,060 genes whose link out to the Mouse Genome Database (MGI) is the invalid identifier MGI:

SELECT DISTINCT gr.bigg_id AS bigg_id
FROM   data_source d, synonym s, genome_region gr
WHERE  d.id = s.synonym_data_source_id AND
       s.ome_id = gr.id AND
       d.name = 'MGI' AND
       s.synonym = 'MGI'
ORDER BY bigg_id;

The pattern for this database is ^MGI:\d+$. I found all uses of genes with this identifier in model iMM1415.

There are some new external data source names: EnsemblGenomes-Gn, EnsemblGenomes-Tr. What is the difference between those and the already existing EnsemblGenomes?

pillmill commented 9 years ago

For the Mouse Genome Database, the following format appears to work correctly: identifiers.org/mgd/MGI:#

For example, http://identifiers.org/mgd/MGI:97485

zakandrewking commented 9 years ago

TODO