Closed pecholleyc closed 3 years ago
Sounds like a valid request.
A new annotation file humanGEMGeneAssoc.JSON
could be added with following fields to begin with:
Ensembl Gene ID
UniProt SwissProt ID
Gene symbol
NCBI Entrez ID
I guess we would want to format the gene annotation structure in the same way that we have done for the reaction and metabolite annotation files. Therefore, I suggest we have a genes
field that is identical to the genes
field in the model structure (just as we had a rxns
and mets
field in the reaction and metabolite files, respectively). We could still keep the Ensembl Gene ID
field, even though it would be the same as the genes
field.
Also, to maintain consistency with the other annotation structures and with COBRA, here are my suggestions for the field names:
geneEnsemblID
(or geneENSGID
)
geneUniProtID
geneNames
geneEntrezID
The reason I'm leaning toward geneENSGID
instead of geneEnsemblID
is that it makes it easier to also add geneENSPID
and geneENSTID
fields if we want.
This issue was resolved in #204.
Hi, so this issue was reopened for, I guess, 2 main reasons that should be discussed in details:
Also what do you think about having the annotation effort be externalized and put in a Third-party pipeline as mentioned in #224? Please help me to understand @Hao-Chalmers, @mihai-sysbio.
@pecholleyc exactly as you summarised.
@mihai-sysbio The reopening of this issue is mainly due to additional (or removal of) gene-related fields requested by Metabolic Atlas
. Once this is settled, the get_ensembl_ID_mapping.py can be adjusted for routine updates.
@pecholleyc some more clarifications on point 1: Metabolic Atlas should presently only data that is coming from the source repositories, without any changes. At the moment, the data shown is a superset of the data in the repository. The main immediate focus is to get these data fully aligned. Of course, it it desirable both for Metabolic Atlas but also for the community at large to have the Human-GEM very well annotated.
It looks like the new fields added through PR #244 are
name
alternate_name
aliases
function
ec
catalytic_activity
At a quick glance, it doesn't seem like these fields come from get_ensembl_ID_mapping.py.
Now there are 5 existing annotation items for genes:
geneENSTID
geneENSPID
geneUniProtID
geneNames
geneEntrezID
As far as I know, all are required except geneENSTID
and geneENSPID
, which have never been used anywhere (@JonathanRob let me know if I were wrong). @mihai-sysbio does Met Atlas need these two fields?
@mihai-sysbio among the six fields you listed, are they all required? can they be modified, such as merging alternate_name
with aliases
? Is the name
field identical to the existing geneNames
field?
I assume all these info can be extracted from Ensembl through a modified get_ensembl_ID_mapping.py script. What do you think about this implementation? @pecholleyc @mihai-sysbio
The column names in the genes.tsv
annotation file were chosen to be consistent with naming of model fields used by COBRA (i.e., gene related fields start with gene
). So if possible, I would recommend using this format for column names, but I suppose it's not a strict requirement.
Regardless, if new columns are added to any of the annotation files, then the annotateModel
function will need to be updated to map that column to the identifiers.org prefix (unless they don't need to be mapped to identifiers.org). Specifically, lines 68-85 of annotateModel
:
id2miriam = {%reactions
'rxnKEGGID' 'kegg.reaction'
'rxnBiGGID' 'bigg.reaction'
'rxnREACTOMEID' 'reactome'
'rxnMetaNetXID' 'metanetx.reaction'
'rxnTCDBID' 'tcdb'
% metabolites
'metBiGGID' 'bigg.metabolite'
'metKEGGID' 'kegg.compound'
'metHMDBID' 'hmdb'
'metChEBIID' 'chebi'
'metPubChemID' 'pubchem.compound'
'metLipidMapsID' 'lipidmaps'
'metMetaNetXID' 'metanetx.chemical'
% genes
'geneNames' 'hgnc.symbol'
'geneEnsemblID' 'ensembl'
'geneEntrezID' 'ncbigene'
'geneUniProtID' 'uniprot'};
@Hao-Chalmers No, I haven't really used the geneENSTID
or geneENSPID
fields myself, but if they're not causing problems, then I would just keep them.
As @Hao-Chalmers notes, all of these fields can be regenerated from a modified get_ensembl_ID_mapping.py script, since they are not manually curated like some of the reaction and metabolite fields.
Of the new columns added, the last 3 (function
, ec
, and catalytic_activity
) are completely empty, and therefore should probably be removed. As @Hao-Chalmers notes, the name
column is redundant with the geneNames
column, but outdated (e.g., MARCH2 is now MARCHF2 - looking at you, Microsoft Excel), so I would remove that as well. Technically, these are gene symbols, not gene names (though this is often dependent on source), so this is something we should probably address at some point; i.e., we should probably rename the geneNames
column to geneSymbols
.
This leaves the alternate_name
and aliases
columns, which I think could be nice to add. These fields are also outdated, and therefore should be re-retrieved before merging. Also, the alternate_name
column is actually just the gene names (not symbols), and could probably be re-named as geneNames
. Finally, the aliases
column could be renamed to geneAliases
for clarity.
So to summarize, it would look something like this:
genes | geneENSTID | geneENSPID | geneUniProtID | geneSymbols | geneEntrezID | geneNames | geneAliases |
---|---|---|---|---|---|---|---|
ENSG00000000419 | ENST00000371582;... | ENSP00000394921;… | O60762 | DPM1 | 8813 | dolichyl-phosphate mannosyltransferase subunit 1, catalytic | CDGIE; … |
ENSG00000000938 | ENST00000399173;… | ENSP00000382126;… | P09769 | FGR | 2268 | FGR proto-oncogene, Src family tyrosine kinase | c-fgr; … |
ENSG00000001036 | ENST00000451668;… | ENSP00000002165;… | Q9BTY2 | FUCA2 | 2519 | alpha-L-fucosidase 2 | dJ20N2.5; … |
ENSG00000001084 | ENST00000509541;… | ENSP00000421908;… | P48506 | GCLC | 2729 | glutamate-cysteine ligase catalytic subunit | GCS; … |
ENSG00000001630 | ENST00000003100;… | ENSP00000406757;… | Q16850 | CYP51A1 | 1595 | cytochrome P450 family 51 subfamily A member 1 | CP51; … |
ENSG00000002549 | ENST00000618908;… | ENSP00000226299;… | P28838 | LAP3 | 51056 | leucine aminopeptidase 3 | LAP; … |
I think focusing on adding only these 2 new columns is a good decision and I can modify the get_ensembl_ID_mapping.py
file for that purpose. I have some comments regarding what has been decided/discussed:
1) altername_name
is indeed the full name of the gene, but it corresponds to the description field in the DB and on the Ensembl gene page. geneNames
(or geneName
:p) is fine in my opinion. I want to highlight that the content of this column is truncated, the full string contains also the source, under bracket. It is visible on the Ensembl gene page. I think it is fine to remove that part.
2) To me, the empty columns function
, ec
, and catalytic_activity
correspond to enzyme/protein annotations. Adding these annotations was in concordance with the fact the gene page of MA was initially called the enzyme page. If MA is supposed to present only the content of the integrated model, then I would avoid adding them for now (even if none empty). If added in the model, I think it would also require a more in-depth-verification; for instance I would expect the gene/enzyme EC codes and catalytic activity description to (partially) match the EC codes and the used metabolites of reactions having these genes in the GPR.
3) Regarding the names of the columns, technically the column headers describe the content of a cell in the columns. Since genes
(geneID?), geneNames
and geneSymbols
always contain only one value, I would remove the plural. It is not really important though.
4) I was wondering if the intermediate file ensembl_ID_mapping.tsv could be removed by generating genes.tsv
directly by the get_ensembl_ID_mapping.py
script. Something to think about, I would not do in that PR.
How do you want to process, should the change be made in the same PR or a new branch/PR? I'm asking this because it is not clear to me if contributing to genes annotations by changing the genes.tsv
is the correct way and should be encouraged. If I understand the process correctly, whatever information directly added to that file (from direct contribution like in the current PR), is likely to be erased when that file is regenerated from Ensembl. If true, maybe it should be specified somewhere.
Based on all the feedback, it looks like the work that needs to be done is to:
get_ensembl_ID_mapping.py
, possibly calling it from a GH Action, possibly also removing ensembl_ID_mapping.tsv
These could be done in one go, or as separate PRs. In this case, what matters most to me is time, ie how much time it will take for the improved genes.tsv
to show in develop
, which is why I am in favour of separate PRs. In the meantime, I've marked #244 as wip
.
@JonathanRob's feedback should now be applied to PR #244. I think this also in line with @pecholleyc's points 1 and 2 @pecholleyc's point 3 is not applied yet, even though it totally makes sense to me. Any thoughts?
if contributing to genes annotations by changing the
genes.tsv
is the correct way
Maybe it's acceptable to change genes.tsv
manually once to set it up and see how the output of the script would diff against this?
Thanks for the quick fix, @mihai-sysbio.
@pecholleyc, I've responded to some of your suggestions/concerns:
Since
genes
(geneID?),geneNames
andgeneSymbols
always contain only one value, I would remove the plural.
Yep, I completely agree. However, since COBRA and RAVEN use the plural form (genes
and geneNames
), I suggest we keep with that standard. Though since gene symbols aren't really a commonly used field, then we could of course choose to make it singular (geneSymbol
). I don't have a strong preference here.
I was wondering if the intermediate file ensembl_ID_mapping.tsv could be removed by generating genes.tsv directly by the get_ensembl_ID_mapping.py script.
Yes, I fully support this. That ensembl_ID_mapping.tsv
file has essentially been made redundant by the creation of the genes.tsv
file, so it should be removed. I will update the translateGrRules
function which uses that ensembl_ID_mapping.tsv
to instead use genes.tsv
file. We should also confirm that no other function depends on that file.
If I understand the process correctly, whatever information directly added to [
genes.tsv
] (from direct contribution like in the current PR), is likely to be erased when that file is regenerated from Ensembl. If true, maybe it should be specified somewhere.
Yep, we should note somewhere that the genes.tsv
file should not be modified directly, since it will be overwritten when updated. And I think we can implement these changes following @mihai-sysbio's suggestion to first change the file manually in #244 (a quicker update), and then in a later PR update the get_ensembl_ID_mapping.py
script to generate the genes.tsv
file directly.
Description of the issue:
The repo offers associations/annotations for reactions (
humanGEMMetAssoc.JSON
) and metabolites (humanGEMRxnAssoc.JSON
) but there is no such json file for genes. Some gene's external identifiers can be found in theensembl_ID_mapping.tsv
file but this is a different format, and I think the list of Ensembl IDs in that file is not only bounds to the Human-GEM model.Expected feature/value/output:
It would be useful to have a
humanGEMGeneAssoc.JSON
file that can be parsed similarly to the other association files.Current feature/value/output:
The file does not exist.
Reproducing these results:
Yes
I hereby confirm that I have:
master
branch of the repositorylink to #75