The conflict gene-protein associations

haowang-bioinfo commented 6 years ago

This issue is reported by @IVANDOMENZAIN.

Description of the issue:

During converting from humanGEM (v 0.1.0) to enzyme-constraint human model, some non-unique gene-protein associations were found and caused conflicts in the GECKO conversion process.

Expected feature/value/output:

The mapping between multiple genes to a single protein (not enzyme complex) doesn't make sense, even though mapping between multiple proteins to a single gene is possible. It is expected to figure out these non-unique associations. And then a protein field (maybe other relevant fields) would be good to have and with the correct and unique association between genes (Ensembl id) and proteins (UniProt id).

Current feature/value/output:

A Matlab file gene_protein_conflicts.mat was provided to demonstrate the current status. It has two cell arrays: conflicts_type1 contains all the UniProt ids that were mapped to more than one gene in the model (44 cases). -> These ones are causing conflicts to the GECKO pipeline, and there are a lot of them; conflicts_type2 contains all the genes in the model that were mapped to more than one UniProt id (4 cases).

I hereby confirm that I have:

[X] Tested my code on my own computer for running the model
[X] Done this analysis in the master branch of the repository
[X] Checked that a similar issue does not exist already

JonathanRob commented 6 years ago

The mapping between multiple genes to a single protein (not enzyme complex) doesn't make sense

I thought I should comment on this point. This occurs when a gene has multiple allele variants (haplotypes), which can have different Ensembl IDs but correspond to the same protein (e.g., UniProt) ID. A good example is NEU1, located in the MHC region of chromosome 6, an area of the genome associated with many haplotypes. Although only one of these variants is on the "primary" genome assembly, I don't yet see a reason to remove the non-primary gene variants from the model.

Edit: Forgot to note, the translateGeneRules.m function, when used to convert grRules from Ensembl IDs to UniProt IDs, will simply merge all of the variants into the single protein ID, so it shouldn't cause any problems.

haowang-bioinfo commented 6 years ago

Thanks for clarification @JonathanRob. Probably it is the genes with multiple allele variants that caused the associations from multiple Ensembl gene ids (with ENSG prefix) to a single Uniport id. It would be worth to figure out if this is the only reason.

JonathanRob commented 6 years ago

Other cases include transcript splice variants, where multiple variants can encode the same protein. For example: AMY1A, AMY1B, AMY1C genes.

haowang-bioinfo commented 6 years ago

These three genes (AMY1A, AMY1B, AMY1C ) appear to be not alternative splicing variants @JonathanRob. Even though located in the same chromosome and next to each other, they have different coordinations: AMY1A (104,197,912-104,207,176) AMY1B (104,230,037-104,239,302) AMY1C (104,293,028-104,301,312). Hence, they should be separated genes but paralogues to each other. In principle, they need to be assigned with different UniProt ids.

JonathanRob commented 6 years ago

My apologies, I was quoting the NCBI entry for these genes. Although I agree that this is an odd case, and should seemingly be represented with different UniProt IDs, the fact is that UniProt currently only has one ID representing all three genes (AMY1A, AMY1B, and AMY1C). This is therefore unavoidable.

We manage the conversion between gene and protein IDs through the latest versions of these databases, so it is possible that future versions will separate them as individual UniProt IDs, and we will update accordingly.

It seems to me that the original issue of this thread has been addressed - we will use the current gene-protein associations that the databases provide, even if some are a bit strange. The cases of multiple gene IDs per protein (or vice-versa) are handled automatically in the translateGeneRules.m function, so are not a problem as far as I am aware. If there is disagreement on how these cases are handled in that function, I would be happy to discuss this further.

haowang-bioinfo commented 6 years ago

I agree that this should be an odd case, we probably have to accept it as such and expect modifications from UniPort in the future.

@JonathanRob, can you help to figure out if this is the only odd case or there are actually a group such cases, which might lead to a different scenario that has to be addressed differently.

haowang-bioinfo commented 6 years ago

Here are questions to @IVANDOMENZAIN.

In humanGEM, Ensembl ids (with ENSG prefix) are used in the genes field. Since UniProt provides two types of ids (SwissProt and Trembl) for a gene. According to gene-protein association provided by Ensembl, a total 6226 Ensembl genes have both SwissProt and Trembl ids. Given that unique gene-protein association is expected by the GECKO pipeline, it would be good to clarify which UniProt id types should be used when both are present?

By checking the two cell arrays of conflicts, all the UniProt ids are Swissprot type. It seems GECKO favors the Swissport version. Should the Trembl association be thus abandoned, when there is no Swissprot version?

Another follow-up question is which versions of UniProt ids are used by BRENDA, from which the kcat information is extracted?

JonathanRob commented 6 years ago

@Hao-Chalmers I investigated all cases, and they fell into three categories:

Genes are allele variants (as mentioned above for, e.g., NEU1. This was the majority of cases.
Genes are identical (or near-identical) sequence, but reside in more than one location in the genome. For example: CDY2A and CDY2B.
The case of AMY1 mentioned above, where I'm uncertain of how exactly these genes differ, but they seem to all encode an identical protein.

A summary of these results will be deposited in the proper location.

IVANDOMENZAIN commented 6 years ago

@Hao-Chalmers, In GECKO we recommend the user to upload a protein database file downloaded from Swissprot for their specific organism, due to the manual curation of this database. For the original model purposes I don't think that the TREMBL association should be totally discarded, but for GECKO purposes, the pipeline will just ignore those genes without a swissprot protein ID (which doesn't affect the model structure because the original metabolic reaction is still gonna be kept as part of the ecModel).

Regarding your question about BRENDA, GECKO does not look for Kcat values based on Uniprot -> Kcat associations in BRENDA, rather it looks for the Uniprot ID -> EC number association (if any) available in the Swissprot database and then the Kcat value is searched in BRENDA based on the obtained EC number and the substrate for the given reaction in the model.

haowang-bioinfo commented 6 years ago

Thanks to @IVANDOMENZAIN, now it's clear that only Swissprot type ids are utilized during the GECKO conversion. @JonathanRob, should we apply this rule by adjusting translateGeneRules function so that only Swissport type Uniprot ids are converted into the proteins and other relevant fields? I have to say I'm not sure if this is a good idea.

JonathanRob commented 6 years ago

@Hao-Chalmers Actually, we already use only SwissProt IDs, because the Ensembl database file we use for ID translation contains SwissProt IDs.

IVANDOMENZAIN commented 6 years ago

@JonathanRob thanks for all your help regarding this issue. Reading at your comments and talking to Benjamin, we have come out with a solution for the type-2 conflict (one gene being mapped to multiple proteins) in the GECKO pipeline, this will require the user to specify which (or maybe all) of the multiple proteins for a given gene should be taken into account for the ecModel construction or just discarded.

Regarding the function translateGeneRules.m I found that, when converting to Uniprot IDs, the output called genes shows a gene list with the same length of the original genes and also all of its elements contain a single uniprot ID. So, the fact of finding multiple uniprot IDs for a given gene is probably still being ignored or at least not updated in that specific output.

haowang-bioinfo commented 6 years ago

Okay @JonathanRob, then we have come to an consensus about using Swissprot ids in implementing translateGeneRules.

@IVANDOMENZAIN by checking Ensembl gene <-> Swissport id association, I didn't find the type-2 conflicts, they should have been resolved by translateGeneRules.

SysBioChalmers / Human-GEM