Annotation file for genes

pecholleyc commented 4 years ago

Description of the issue:

The repo offers associations/annotations for reactions (humanGEMMetAssoc.JSON) and metabolites (humanGEMRxnAssoc.JSON) but there is no such json file for genes. Some gene's external identifiers can be found in the ensembl_ID_mapping.tsv file but this is a different format, and I think the list of Ensembl IDs in that file is not only bounds to the Human-GEM model.

Expected feature/value/output:

It would be useful to have a humanGEMGeneAssoc.JSON file that can be parsed similarly to the other association files.

Current feature/value/output:

The file does not exist.

Reproducing these results:

Yes

I hereby confirm that I have:

[X] Tested my code on my own computer for running the model
[X] Done this analysis in the master branch of the repository
[X] Checked that a similar issue does not exist already

link to #75

haowang-bioinfo commented 4 years ago

Sounds like a valid request.

haowang-bioinfo commented 4 years ago

A new annotation file humanGEMGeneAssoc.JSON could be added with following fields to begin with: Ensembl Gene ID UniProt SwissProt ID
Gene symbol NCBI Entrez ID

JonathanRob commented 4 years ago

I guess we would want to format the gene annotation structure in the same way that we have done for the reaction and metabolite annotation files. Therefore, I suggest we have a genes field that is identical to the genes field in the model structure (just as we had a rxns and mets field in the reaction and metabolite files, respectively). We could still keep the Ensembl Gene ID field, even though it would be the same as the genes field.

Also, to maintain consistency with the other annotation structures and with COBRA, here are my suggestions for the field names: geneEnsemblID (or geneENSGID) geneUniProtID geneNames geneEntrezID

The reason I'm leaning toward geneENSGID instead of geneEnsemblID is that it makes it easier to also add geneENSPID and geneENSTID fields if we want.

JonathanRob commented 3 years ago

This issue was resolved in #204.

pecholleyc commented 3 years ago

Hi, so this issue was reopened for, I guess, 2 main reasons that should be discussed in details:

Provide additional annotations for genes in the repository.
- Metabolic Atlas wants to fetch more annotations directly from the model repositories. what annotations exactly? from which database? (maybe something that is (or planned) for GEMs adopting .standard-GEM)
- Other demands for annotations?
The annotations format has changed and the updated process needs to be changed as well (see #224, #203).
- How the genes.tsv is getting updated when the model change?
- Previously it was only external IDs (and the name) from Ensembl, should (other) information be fetched from other sources? And should annotations of the corresponding Proteins/Enzymes be added?

Also what do you think about having the annotation effort be externalized and put in a Third-party pipeline as mentioned in #224? Please help me to understand @Hao-Chalmers, @mihai-sysbio.

haowang-bioinfo commented 3 years ago

@pecholleyc exactly as you summarised.

@mihai-sysbio The reopening of this issue is mainly due to additional (or removal of) gene-related fields requested by Metabolic Atlas. Once this is settled, the get_ensembl_ID_mapping.py can be adjusted for routine updates.

mihai-sysbio commented 3 years ago

@pecholleyc some more clarifications on point 1: Metabolic Atlas should presently only data that is coming from the source repositories, without any changes. At the moment, the data shown is a superset of the data in the repository. The main immediate focus is to get these data fully aligned. Of course, it it desirable both for Metabolic Atlas but also for the community at large to have the Human-GEM very well annotated.

mihai-sysbio commented 3 years ago

It looks like the new fields added through PR #244 are

name
alternate_name
aliases
function
ec
catalytic_activity

At a quick glance, it doesn't seem like these fields come from get_ensembl_ID_mapping.py.

haowang-bioinfo commented 3 years ago

Now there are 5 existing annotation items for genes: geneENSTID geneENSPID geneUniProtID geneNames geneEntrezID As far as I know, all are required except geneENSTID and geneENSPID, which have never been used anywhere (@JonathanRob let me know if I were wrong). @mihai-sysbio does Met Atlas need these two fields?

@mihai-sysbio among the six fields you listed, are they all required? can they be modified, such as merging alternate_name with aliases? Is the name field identical to the existing geneNames field?

I assume all these info can be extracted from Ensembl through a modified get_ensembl_ID_mapping.py script. What do you think about this implementation? @pecholleyc @mihai-sysbio

JonathanRob commented 3 years ago

The column names in the genes.tsv annotation file were chosen to be consistent with naming of model fields used by COBRA (i.e., gene related fields start with gene). So if possible, I would recommend using this format for column names, but I suppose it's not a strict requirement.

Regardless, if new columns are added to any of the annotation files, then the annotateModel function will need to be updated to map that column to the identifiers.org prefix (unless they don't need to be mapped to identifiers.org). Specifically, lines 68-85 of annotateModel:

id2miriam = {%reactions
             'rxnKEGGID'        'kegg.reaction'
             'rxnBiGGID'        'bigg.reaction'
             'rxnREACTOMEID'    'reactome'
             'rxnMetaNetXID'    'metanetx.reaction'
             'rxnTCDBID'        'tcdb'
             % metabolites
             'metBiGGID'        'bigg.metabolite'
             'metKEGGID'        'kegg.compound'
             'metHMDBID'        'hmdb'
             'metChEBIID'       'chebi'
             'metPubChemID'     'pubchem.compound'
             'metLipidMapsID'   'lipidmaps'
             'metMetaNetXID'    'metanetx.chemical'
             % genes
             'geneNames'        'hgnc.symbol'
             'geneEnsemblID'    'ensembl'
             'geneEntrezID'     'ncbigene'
             'geneUniProtID'    'uniprot'};

@Hao-Chalmers No, I haven't really used the geneENSTID or geneENSPID fields myself, but if they're not causing problems, then I would just keep them.

As @Hao-Chalmers notes, all of these fields can be regenerated from a modified get_ensembl_ID_mapping.py script, since they are not manually curated like some of the reaction and metabolite fields.

Of the new columns added, the last 3 (function, ec, and catalytic_activity) are completely empty, and therefore should probably be removed. As @Hao-Chalmers notes, the name column is redundant with the geneNames column, but outdated (e.g., MARCH2 is now MARCHF2 - looking at you, Microsoft Excel), so I would remove that as well. Technically, these are gene symbols, not gene names (though this is often dependent on source), so this is something we should probably address at some point; i.e., we should probably rename the geneNames column to geneSymbols.

This leaves the alternate_name and aliases columns, which I think could be nice to add. These fields are also outdated, and therefore should be re-retrieved before merging. Also, the alternate_name column is actually just the gene names (not symbols), and could probably be re-named as geneNames. Finally, the aliases column could be renamed to geneAliases for clarity.

So to summarize, it would look something like this:

genes	geneENSTID	geneENSPID	geneUniProtID	geneSymbols	geneEntrezID	geneNames	geneAliases
ENSG00000000419	ENST00000371582;...	ENSP00000394921;…	O60762	DPM1	8813	dolichyl-phosphate mannosyltransferase subunit 1, catalytic	CDGIE; …
ENSG00000000938	ENST00000399173;…	ENSP00000382126;…	P09769	FGR	2268	FGR proto-oncogene, Src family tyrosine kinase	c-fgr; …
ENSG00000001036	ENST00000451668;…	ENSP00000002165;…	Q9BTY2	FUCA2	2519	alpha-L-fucosidase 2	dJ20N2.5; …
ENSG00000001084	ENST00000509541;…	ENSP00000421908;…	P48506	GCLC	2729	glutamate-cysteine ligase catalytic subunit	GCS; …
ENSG00000001630	ENST00000003100;…	ENSP00000406757;…	Q16850	CYP51A1	1595	cytochrome P450 family 51 subfamily A member 1	CP51; …
ENSG00000002549	ENST00000618908;…	ENSP00000226299;…	P28838	LAP3	51056	leucine aminopeptidase 3	LAP; …

pecholleyc commented 3 years ago

I think focusing on adding only these 2 new columns is a good decision and I can modify the get_ensembl_ID_mapping.py file for that purpose. I have some comments regarding what has been decided/discussed:

1) altername_name is indeed the full name of the gene, but it corresponds to the description field in the DB and on the Ensembl gene page. geneNames (or geneName :p) is fine in my opinion. I want to highlight that the content of this column is truncated, the full string contains also the source, under bracket. It is visible on the Ensembl gene page. I think it is fine to remove that part.

2) To me, the empty columns function, ec, and catalytic_activity correspond to enzyme/protein annotations. Adding these annotations was in concordance with the fact the gene page of MA was initially called the enzyme page. If MA is supposed to present only the content of the integrated model, then I would avoid adding them for now (even if none empty). If added in the model, I think it would also require a more in-depth-verification; for instance I would expect the gene/enzyme EC codes and catalytic activity description to (partially) match the EC codes and the used metabolites of reactions having these genes in the GPR.

3) Regarding the names of the columns, technically the column headers describe the content of a cell in the columns. Since genes (geneID?), geneNames and geneSymbols always contain only one value, I would remove the plural. It is not really important though.

4) I was wondering if the intermediate file ensembl_ID_mapping.tsv could be removed by generating genes.tsv directly by the get_ensembl_ID_mapping.py script. Something to think about, I would not do in that PR.

How do you want to process, should the change be made in the same PR or a new branch/PR? I'm asking this because it is not clear to me if contributing to genes annotations by changing the genes.tsv is the correct way and should be encouraged. If I understand the process correctly, whatever information directly added to that file (from direct contribution like in the current PR), is likely to be erased when that file is regenerated from Ensembl. If true, maybe it should be specified somewhere.

mihai-sysbio commented 3 years ago

Based on all the feedback, it looks like the work that needs to be done is to:

delete and rename some columns in PR #244
update get_ensembl_ID_mapping.py, possibly calling it from a GH Action, possibly also removing ensembl_ID_mapping.tsv

These could be done in one go, or as separate PRs. In this case, what matters most to me is time, ie how much time it will take for the improved genes.tsv to show in develop, which is why I am in favour of separate PRs. In the meantime, I've marked #244 as wip.

mihai-sysbio commented 3 years ago

@JonathanRob's feedback should now be applied to PR #244. I think this also in line with @pecholleyc's points 1 and 2 @pecholleyc's point 3 is not applied yet, even though it totally makes sense to me. Any thoughts?

if contributing to genes annotations by changing the genes.tsv is the correct way

Maybe it's acceptable to change genes.tsv manually once to set it up and see how the output of the script would diff against this?

JonathanRob commented 3 years ago

Thanks for the quick fix, @mihai-sysbio.

@pecholleyc, I've responded to some of your suggestions/concerns:

Since genes (geneID?), geneNames and geneSymbols always contain only one value, I would remove the plural.

Yep, I completely agree. However, since COBRA and RAVEN use the plural form (genes and geneNames), I suggest we keep with that standard. Though since gene symbols aren't really a commonly used field, then we could of course choose to make it singular (geneSymbol). I don't have a strong preference here.

I was wondering if the intermediate file ensembl_ID_mapping.tsv could be removed by generating genes.tsv directly by the get_ensembl_ID_mapping.py script.

Yes, I fully support this. That ensembl_ID_mapping.tsv file has essentially been made redundant by the creation of the genes.tsv file, so it should be removed. I will update the translateGrRules function which uses that ensembl_ID_mapping.tsv to instead use genes.tsv file. We should also confirm that no other function depends on that file.

If I understand the process correctly, whatever information directly added to [genes.tsv] (from direct contribution like in the current PR), is likely to be erased when that file is regenerated from Ensembl. If true, maybe it should be specified somewhere.

Yep, we should note somewhere that the genes.tsv file should not be modified directly, since it will be overwritten when updated. And I think we can implement these changes following @mihai-sysbio's suggestion to first change the file manually in #244 (a quicker update), and then in a later PR update the get_ensembl_ID_mapping.py script to generate the genes.tsv file directly.

SysBioChalmers / Human-GEM