Open draeger opened 9 years ago
Some of these have been or are being fixed:
[
and ]
. We need to test before we deploy, because the change was paired with a number of other database fixes.Cryl1.
is now CRYL
.Not fixed:
G_Prps1_AT1__SBML_DOT__1
, G_S
, G_d_AT1
, G_n
, and G_V
should be looked intoTo help keep everyone up-to-date, I have shared a Dropbox folder with the latest database dumps. There is a link on the BiGG2 README. It's here: https://www.dropbox.com/sh/yayfmcrsrtrcypw/AACDoew92pCYlSJa8vCs5rSMa?dl=0
as far as _SBMLDOT goes, this is done by cobrapy because NCBI gene id's are allowed to have periods, and SBML id's are not (and we seem to have standardized on NCBI gene id's). This is the only non-allowed character, so it's the only one which should be getting escaped.
After discussion: we need to find out what characters are allowed in NCBI/RefSeq locus ids. If it is just [a-zA-Z0-9._], then I am fine with supporting DOT as the official way to deal with gene ids that have dots in them when exporting to SBML.
On Tue, Aug 25, 2015 at 3:26 PM, Ali Ebrahim notifications@github.com wrote:
as far as _SBMLDOT goes, this is done by cobrapy because NCBI gene id's are allowed to have periods, and SBML id's are not (and we seem to have standardized on NCBI gene id's). This is the only non-allowed character, so it's the only one which should be getting escaped.
Reply to this email directly or view it on GitHub: https://github.com/SBRG/BIGG2/issues/157#issuecomment-134759939
Looks like: http://identifiers.org/biocyc/S-ADENOSYL-4-METHYLTHIO-2-OXOBUTANOATE
should be: http://identifiers.org/biocyc/ECOLI:S-ADENOSYL-4-METHYLTHIO-2-OXOBUTANOATE http://identifiers.org/biocyc/S-ADENOSYL-4-METHYLTHIO-2-OXOBUTANOATE
On Tue, Aug 25, 2015 at 1:12 AM, Andreas Dräger notifications@github.com wrote:
- For reaction 'R_GULN3D' in model 'iAB_RBC_283' the geneProductAssociation in BiGG is 'Cryl1.' whereas in the model it is 'G_CRYL1'. The problem is that the identifier in BiGG is lower case and in the model it is upper case.
- The next problem with this geneProductAssociation is that it ends with a dot. While internal dots seem to be usually replaced with ' _SBMLDOT', this needs to be documented somewhere. Second, the ending dot needs to be dropped here. I wonder if the entry is just wrong in BiGG?
- The following link out does not work: http://identifiers.org/biocyc/S-ADENOSYL-4-METHYLTHIO-2-OXOBUTANOATE
— Reply to this email directly or view it on GitHub https://github.com/SBRG/BIGG2/issues/157.
Thanks for this update!
Square brackets around chemical formulas can be easily removed, so that's not a big problem (once discovered).
R
in formulas is actually also acceptable and justified. I would therefore request this to be added to the specification of FBC. The other elements (c
, e
, p
, X
, %FULLR%
, e.g. in CO2FULLR
, m
, hp
, etc.) however, should be looked into.
Further gene ids from geneProductAssociations in BiGG that are not in the corresponding model:
G_braG_
and G__livF
in iJN678
G_t
and G_Z
in iETEC_1333
G_S
and G_d
in iMM1415
G_Y
and G_W
in iSBO_1134
Please note that those genes are also used in other models as well. I list them only once here.There are some geneProductAssociations which cannot be parsed because they contain or or
:
SELECT mr.gene_reaction_rule
FROM model_reaction mr, reaction r, model m
WHERE r.id = mr.reaction_id AND
m.id = mr.model_id AND
mr.gene_reaction_rule IS NOT NULL AND
mr.gene_reaction_rule like '%or or%';
GeneProductAssociation (((EcHS_A1008 and EcHS_A3302) and EcHS_A2736) or (EcHS_A1008 and EcHS_A3302)
is missing a closing parenthesis and can therefore not be parsed.
I am also fine with replacing dots with __DOT__
or anything else as long as we document this somewhere. If more special cases are found, please keep me updated on the replacement rules.
About the annotation error, thanks @pillmill for looking up the correct link. The external id should be updated in BiGG.
The following SQL statement shows the lower-case chemical formulas:
SELECT bigg_id, name, TRIM(LTRIM(m.formula, '['''), ''']') AS formula
FROM metabolite m, component c
WHERE c.id = m.id AND length(formula) = 1 AND formula ~ '[^[:upper:]]'
ORDER BY 1, 2, 3;
There don't appear to be any instances of 'or or' in the gene_reaction_rule in the current database. The query produces a null result.
On Tue, Aug 25, 2015 at 9:31 PM, Andreas Dräger notifications@github.com wrote:
Thanks for this update!
Square brackets around chemical formulas can be easily removed, so that's not a big problem (once discovered).
R in formulas is actually also acceptable and justified. I would therefore request this to be added to the specification of FBC. The other elements (c, e, p, X, %FULLR%, e.g. in CO2FULLR, m, hp, etc.) however, should be looked into.
Further gene ids from geneProductAssociations in BiGG that are not in the corresponding model:
- GbraG and G__livF in iJN678
- G_t and G_Z in iETEC_1333
- G_Sand G_d in iMM1415
- G_Y and G_W in iSBO_1134 Please note that those genes are also used in other models as well. I list them only once here.
There are some geneProductAssociations which cannot be parsed because they contain or or:
SELECT mr.gene_reaction_ruleFROM model_reaction mr, reaction r, model mWHERE r.id = mr.reaction_id AND m.id = mr.model_id AND mr.gene_reaction_rule IS NOT NULL AND mr.gene_reaction_rule like '%or or%';
GeneProductAssociation (((EcHS_A1008 and EcHS_A3302) and EcHS_A2736) or (EcHS_A1008 and EcHS_A3302) is missing a closing parenthesis and can therefore not be parsed.
I am also fine with replacing dots with DOT or anything else as long as we document this somewhere. If more special cases are found, please keep me updated on the replacement rules.
About the annotation error, thanks @pillmill https://github.com/pillmill for looking up the correct link. The external id should be updated in BiGG.
— Reply to this email directly or view it on GitHub https://github.com/SBRG/BIGG2/issues/157#issuecomment-134825463.
@pillmill Thanks! These are great news. I am now switching over to Zak's new dump.
I checked now all raw models with the latest version of BiGG (thanks to @jslu9 and @zakandrewking). Several issues still need to be fixed on the ModelPolisher side and I am going through those one by one. Here is now what I found in BiGG database:
iAB_RBC_283
does not have the gene products
G_Cryl1
(used in reaction R_GULN3D
),G_Prps1l1_AT2
(reaction R_PRPPS
),G_S
(reaction R_SBTD_D2
),G_d_AT1
(reaction R_SBTD_D2
).c
, e
, p
etc.), see above, used in several models.iEcHS_1320
uses gene-product association (((EcHS_A1008 and EcHS_A3302) and EcHS_A2736) or (EcHS_A1008 and EcHS_A3302)
, which cannot be parsed (missing closing parenthesis).iECP_1309
makes use of the following gene products that are not in the model: G_t
, G_Z
(reactions R_DMSOR1pp
and R_TMAOR2pp
). The same reaction and genes also cause this problem in models iETEC_1333
, iSFV_1184
, and iSDY_1059
.iJN678
uses gene products G_braG_
and G__livF
in reaction R_UREAabcpp
, but both are not defined in the model.iMM1415
or or
):
(Pde3a) or (Pde4b) or (Pde8a) or (Pde1b) or (Pde7a) or (Pde4c) or (Pde7a) or (Pde1c) or (Pde4d) or (Pde1a) or (Pde3b) or (Pde8b) or (Pde10a) or (Pde8a) or (Pde3b) or or (Pde2a) or (Pde1a) or (Pde8a) or (Pde11a) or (Pde7b) or (Pde8a) or (Pde8a)
and(Pde3a) or (Pde6c) or or (Pde5a) or or (Pde1a) or or (Pde11a) or (Pde5a) or (Pde10a) or or (Pde6a and Pde6b and Pde6h) or (Pde3b) or or or (Pde1c) or (Pde1a) or (Pde2a) or or (Pde1b) or (Pde5a) or or or or or or or or (Pde3b) or or or (Pde6a and Pde6c and Pde6d and Pde6b and Pde6g) or
and also(Uap1) or
G_S
and G_d
in reaction R_SBTD_D2
without their declaration.MGI
for external database MGI
, but this is not correct.iSBO_1134
uses the following gene products without declaration
G_t
, G_Z
, G_Y
(reaction R_DMSOR1pp
),G_n
, G_W
(reaction R_NHFRBO
),G_t
, GY
, G_t
, G_Z
(reaction R_TMAOR2pp
)null
identifier for weird databases in model iYO844
, where I need to check how this happens. Databases read something like BG10073
, CAB11787
, and so on. These look themselves like ids.Note that gene-product-rules are only taken from BiGG and what is already in the model is replaced. Rationale: BiGG is the knowledgebase that is contentiously curated and therefore most reliable.
On Tue, Sep 1, 2015 at 12:39 PM, Andreas Dräger notifications@github.com wrote:
I checked now all raw models with the latest version of BiGG (thanks to @jslu9 https://github.com/jslu9 and @zakandrewking https://github.com/zakandrewking). Several issues still need to be fixed on the ModelPolisher side and I am going through those one by one. Here is now what I found in BiGG database:
- Model iAB_RBC_283 does not have the gene products G_Cryl1 (used in reaction R_GULN3D), G_Prps1l1_AT2 (reaction R_PRPPS), G_S(reaction R_SBTD_D2), G_d_AT1 (reaction R_SBTD_D2).
Check with Aarash to make sure he didn't remove those intentionally (or because of his algorithm)
- Found chemical formulas for species in lower (c, e, p etc.), see above, used in several models.
- Model iEcHS_1320 uses gene-product association (((EcHS_A1008 and EcHS_A3302) and EcHS_A2736) or (EcHS_A1008 and EcHS_A3302), which cannot be parsed (missing closing parenthesis).
- Model iECP_1309 makes use of the following gene products that are not in the model: G_t, G_Z (reactions R_DMSOR1pp and R_TMAOR2pp). The same reaction and genes also cause this problem in models iETEC_1333, iSFV_1184, and iSDY_1059.
- Model iJN678uses gene products GbraG and G__livF in reaction R_UREAabcpp, but both are not defined in the model.
- Model iMM1415 contains two gene-product associations that cannot be parsed because of or or: (Pde3a) or (Pde4b) or (Pde8a) or (Pde1b) or (Pde7a) or (Pde4c) or (Pde7a) or (Pde1c) or (Pde4d) or (Pde1a) or (Pde3b) or (Pde8b) or (Pde10a) or (Pde8a) or (Pde3b) or or (Pde2a) or (Pde1a) or (Pde8a) or (Pde11a) or (Pde7b) or (Pde8a) or (Pde8a) and also (Pde3a) or (Pde6c) or or (Pde5a) or or (Pde1a) or or (Pde11a) or (Pde5a) or (Pde10a) or or (Pde6a and Pde6b and Pde6h) or (Pde3b) or or or (Pde1c) or (Pde1a) or (Pde2a) or or (Pde1b) or (Pde5a) or or or or or or or or (Pde3b) or or or (Pde6a and Pde6c and Pde6d and Pde6b and Pde6g) or and (Uap1) or
- Model iMM1415 also uses gene products G_S and G_din reaction R_SBTD_D2 without their declaration.
- Model iMM1415 makes repeated use of identifier MGI for external database MGI, but this is not correct.
Check with ines or neema about this
- Model iSBO_1134 uses the following gene products without declaration G_t, G_Z, G_Y (reaction R_DMSOR1pp), G_n, G_W (reaction R_NHFRBO), G_t, GY, G_t, G_Z (reaction R_TMAOR2pp)
Are these Jon's models?
- There is a null identifier for weird databases in model iYO844, where I need to check how this happens. Databases read something like BG10073, CAB11787, and so on. These look themselves like ids.
Note that gene-product-rules are only taken from BiGG and what is already in the model is replaced. Rationale: BiGG is the knowledgebase that is contentiously curated and therefore most reliable.
— Reply to this email directly or view it on GitHub https://github.com/SBRG/BIGG2/issues/157#issuecomment-136836748.
The following statement yields 22,060 genes whose link out to the Mouse Genome Database (MGI) is the invalid identifier MGI
:
SELECT DISTINCT gr.bigg_id AS bigg_id
FROM data_source d, synonym s, genome_region gr
WHERE d.id = s.synonym_data_source_id AND
s.ome_id = gr.id AND
d.name = 'MGI' AND
s.synonym = 'MGI'
ORDER BY bigg_id;
The pattern for this database is ^MGI:\d+$
. I found all uses of genes with this identifier in model iMM1415
.
There are some new external data source names: EnsemblGenomes-Gn
, EnsemblGenomes-Tr
. What is the difference between those and the already existing EnsemblGenomes
?
For the Mouse Genome Database, the following format appears to work correctly: identifiers.org/mgd/MGI:#
For example, http://identifiers.org/mgd/MGI:97485
['
and the suffix ']
TODO
Chemical formula errors:
['
and end with the suffix']
. I have trimmed those thrings, but maybe these have some relevance? Should those be removed in BiGG?C15H27N2O9PRS
,C23H41N2O9PRS
,C25H45N2O9PRS
,C27H49N2O9PRS
,C29H53N2O9PRS
,C23H43N2O9PRS
,C21H39N2O9PRS
,C3H6NOR
and many more containR
, which is not in the periodic table. Assuming it means "rest" or "residue", should it be acceptable?XH
,XH2
,XC16H30O1
, etc. are not a valid chemical formula becauseX
is not in the periodic table.c
,e
,p
are not in the periodic table and it is unclear why these are chemical formulas in modelsiAPECO1_1312
,iEC55989_1330
,iEC042_1314
, and many more.C11H23O4NPFULLRCO2
in modeliB21_1397
C10H21O5NPFULLR
,C11H23O4NPFULLRCO2
,C11H23O4NPFULLRCO2
in modeliAT_PLT_636
Gene association errors:
R_GULN3D
in model 'iAB_RBC_283' the geneProductAssociation in BiGG isCryl1.
whereas in the model it isG_CRYL1
. The problem is that the identifier in BiGG is lower case and in the model it is upper case.__SBML_DOT__
, this needs to be documented somewhere. Second, the ending dot needs to be dropped here. I wonder if the entry is just wrong in BiGG?Elements in BiGG whose ids cannot be found not in the corresponding model:
G_Cryl1
(see above),G_Prps1_AT1__SBML_DOT__1
,G_S
andG_d_AT1
iniAB_RBC_283
G_n
,G_V
in modelic_1306
These unknown elements are stored in the tables in BiGG that represent geneProductAssociations, but in the corresponding model these elements are called differently.Annotation errors: