Closed haowang-bioinfo closed 5 years ago
The mapping between multiple genes to a single protein (not enzyme complex) doesn't make sense
I thought I should comment on this point. This occurs when a gene has multiple allele variants (haplotypes), which can have different Ensembl IDs but correspond to the same protein (e.g., UniProt) ID. A good example is NEU1, located in the MHC region of chromosome 6, an area of the genome associated with many haplotypes. Although only one of these variants is on the "primary" genome assembly, I don't yet see a reason to remove the non-primary gene variants from the model.
Edit: Forgot to note, the translateGeneRules.m
function, when used to convert grRules from Ensembl IDs to UniProt IDs, will simply merge all of the variants into the single protein ID, so it shouldn't cause any problems.
Thanks for clarification @JonathanRob. Probably it is the genes with multiple allele variants that caused the associations from multiple Ensembl gene ids (with ENSG prefix) to a single Uniport id. It would be worth to figure out if this is the only reason.
Other cases include transcript splice variants, where multiple variants can encode the same protein. For example: AMY1A, AMY1B, AMY1C genes.
These three genes (AMY1A, AMY1B, AMY1C ) appear to be not alternative splicing variants @JonathanRob. Even though located in the same chromosome and next to each other, they have different coordinations: AMY1A (104,197,912-104,207,176) AMY1B (104,230,037-104,239,302) AMY1C (104,293,028-104,301,312). Hence, they should be separated genes but paralogues to each other. In principle, they need to be assigned with different UniProt ids.
My apologies, I was quoting the NCBI entry for these genes. Although I agree that this is an odd case, and should seemingly be represented with different UniProt IDs, the fact is that UniProt currently only has one ID representing all three genes (AMY1A, AMY1B, and AMY1C). This is therefore unavoidable.
We manage the conversion between gene and protein IDs through the latest versions of these databases, so it is possible that future versions will separate them as individual UniProt IDs, and we will update accordingly.
It seems to me that the original issue of this thread has been addressed - we will use the current gene-protein associations that the databases provide, even if some are a bit strange. The cases of multiple gene IDs per protein (or vice-versa) are handled automatically in the translateGeneRules.m
function, so are not a problem as far as I am aware. If there is disagreement on how these cases are handled in that function, I would be happy to discuss this further.
I agree that this should be an odd case, we probably have to accept it as such and expect modifications from UniPort in the future.
@JonathanRob, can you help to figure out if this is the only odd case or there are actually a group such cases, which might lead to a different scenario that has to be addressed differently.
Here are questions to @IVANDOMENZAIN.
In humanGEM, Ensembl ids (with ENSG prefix) are used in the genes
field. Since UniProt provides two types of ids (SwissProt and Trembl) for a gene. According to gene-protein association provided by Ensembl, a total 6226 Ensembl genes have both SwissProt and Trembl ids. Given that unique gene-protein association is expected by the GECKO pipeline, it would be good to clarify which UniProt id types should be used when both are present?
By checking the two cell arrays of conflicts, all the UniProt ids are Swissprot type. It seems GECKO favors the Swissport version. Should the Trembl association be thus abandoned, when there is no Swissprot version?
Another follow-up question is which versions of UniProt ids are used by BRENDA, from which the kcat information is extracted?
@Hao-Chalmers I investigated all cases, and they fell into three categories:
A summary of these results will be deposited in the proper location.
@Hao-Chalmers, In GECKO we recommend the user to upload a protein database file downloaded from Swissprot for their specific organism, due to the manual curation of this database. For the original model purposes I don't think that the TREMBL association should be totally discarded, but for GECKO purposes, the pipeline will just ignore those genes without a swissprot protein ID (which doesn't affect the model structure because the original metabolic reaction is still gonna be kept as part of the ecModel).
Regarding your question about BRENDA, GECKO does not look for Kcat values based on Uniprot -> Kcat associations in BRENDA, rather it looks for the Uniprot ID -> EC number association (if any) available in the Swissprot database and then the Kcat value is searched in BRENDA based on the obtained EC number and the substrate for the given reaction in the model.
Thanks to @IVANDOMENZAIN, now it's clear that only Swissprot type ids are utilized during the GECKO conversion. @JonathanRob, should we apply this rule by adjusting translateGeneRules
function so that only Swissport type Uniprot ids are converted into the proteins
and other relevant fields? I have to say I'm not sure if this is a good idea.
@Hao-Chalmers Actually, we already use only SwissProt IDs, because the Ensembl database file we use for ID translation contains SwissProt IDs.
@JonathanRob thanks for all your help regarding this issue. Reading at your comments and talking to Benjamin, we have come out with a solution for the type-2 conflict (one gene being mapped to multiple proteins) in the GECKO pipeline, this will require the user to specify which (or maybe all) of the multiple proteins for a given gene should be taken into account for the ecModel construction or just discarded.
Regarding the function translateGeneRules.m
I found that, when converting to Uniprot IDs, the output called genes
shows a gene list with the same length of the original genes and also all of its elements contain a single uniprot ID. So, the fact of finding multiple uniprot IDs for a given gene is probably still being ignored or at least not updated in that specific output.
Okay @JonathanRob, then we have come to an consensus about using Swissprot ids in implementing translateGeneRules
.
@IVANDOMENZAIN by checking Ensembl gene <-> Swissport id association, I didn't find the type-2 conflicts, they should have been resolved by translateGeneRules
.
This issue is reported by @IVANDOMENZAIN.
Description of the issue:
During converting from humanGEM (v 0.1.0) to enzyme-constraint human model, some non-unique gene-protein associations were found and caused conflicts in the GECKO conversion process.
Expected feature/value/output:
The mapping between multiple genes to a single protein (not enzyme complex) doesn't make sense, even though mapping between multiple proteins to a single gene is possible. It is expected to figure out these non-unique associations. And then a
protein
field (maybe other relevant fields) would be good to have and with the correct and unique association between genes (Ensembl id) and proteins (UniProt id).Current feature/value/output:
A Matlab file
gene_protein_conflicts.mat
was provided to demonstrate the current status. It has two cell arrays:conflicts_type1
contains all the UniProt ids that were mapped to more than one gene in the model (44 cases). -> These ones are causing conflicts to the GECKO pipeline, and there are a lot of them;conflicts_type2
contains all the genes in the model that were mapped to more than one UniProt id (4 cases).I hereby confirm that I have:
master
branch of the repository