Gene alleles - Githubissues

JonathanRob commented 5 years ago

Description of the issue:

The model currently contains multiple alleles of many genes, which have identical gene abbreviations, but different Ensembl gene IDs. For example, the VARS gene (ENSG00000204394) has 6 alleles:

ENSG00000231116
ENSG00000226589
ENSG00000096171
ENSG00000224264
ENSG00000227686
ENSG00000231945

All of these IDs are currently all present in the model. The ENSG00000204394 ID corresponds to the genotype of the primary assembly of the human genome, and will therefore be the gene ID that is typically present in, e.g., RNA-Seq data.

In my opinion, there is no need for all of these gene variants, and leaving them in the model may lead to confusion. I propose that we remove these allelic gene variants, and keep only the gene IDs corresponding to the primary human genome assembly. This will also help in simplifying many gene-reaction rules, which would be beneficial.

I hereby confirm that I have:

[X] Checked that a similar issue does not exist already

pecholleyc commented 5 years ago

Ensembl refers them as:

Alternate sequences in human

All genome assemblies in Ensembl are haploid, and for most species there is only a single path through the genome. Currently, human is the only genome assembly where there is more than one path through the genome. The human genome assembly is maintained by the Genome Reference Consortium (GRC). The GRCh37 primary assembly comprises 24 chromosomes plus 39 unplaced scaffolds. In addition to the primary assembly, the GRCh37 major assembly release included 9 alternate loci including 6 haplotypes on the MHC region of chromosome 6. Subsequent minor releases on the GRCh37 assembly introduce additional alternate sequences known as patches.

There are two types of assembly patches:
Novel patches: provide alternate alleles. These regions are colored red in the Chromosome summary page and Region in detail page.
Fix patches: provide improved sequence for known assembly errors. These patches will be incorporated into the primary assembly in the next major assembly release. They are colored green in the Chromosome summary page and Region in detail page.
Minor assembly releases have the following naming convention: GRCh37.p7 for the seventh patch release of GRCh37.

I agree I think we can discard those "secondary IDs". I used BioMart on the Ensembl website to get only the list of IDs on the primary assembly. The query and results are available here

I filtered on names of scaffold/chromosome to, as it is explains in this post https://www.biostars.org/p/109510/

ens_stableID_primary_assembly.txt

Now, a way to correct the genes IDs of the model would be I think, to get the names from the existing IDs of the model (can be done with BioMart as well) cross ref with the file and replace when different.

Update:

include missing chromosomes X,Y and MT in the query
add gene position information in the result file

JonathanRob commented 5 years ago

@pecholleyc Great, thank you. We can plan to update the model with this gene information in the near future.

haowang-bioinfo commented 5 years ago

It should be alright to remove (or merge) allelic gene variants that are (nearly) identical to each other and clearly have the same function.

For the allelic variants with different functions (e.g. associated with varied or different phenotypes), they should be kept and somehow properly organized (not sure if there is any of these in the model).

pecholleyc commented 5 years ago

I discussed with the helpdesk of Ensembl the best way to have the primary assembly IDs is to use the MySQL database (or using the BioPerl API... ), I already made a python-mysql script but I need you feedback before finalizing it, here is what I suggest:

The script will use as input:

the version of Ensembl (current version 95)
the version of the human genome (current version 38) (these 2 parameters are required to connect to a database, e.g. homo_sapiens_core_95_38)
A one-column file with all the Ensembl gene IDs in the model

The script will output (or update the input file) with 2 columns per line:

ID                   primary assembly ID
 ENSG00000285332      ENSG00000284826

There are some points to consider:

How to deal with the genes in the model that cannot be associated to any genes on the primary assembly?
When to update the list of IDs ? FYI when a new version of the genome is integrated to Ensembl only the new database is updated (i.e. homo_sapiens_core_95_37 is not updated anymore).
Since this solution is using direct requests to the Ensembl database, any change in their database model might break the queries (which can be avoided by using the perl API). But I think this is very unlikely to happen, the requests are targeting only the core tables.
We can make use to this connection to the Ensembl database to retrieve the transcripts IDs as well, in a separate script.

Someone can work on a sort of wrapper in Matlab to call the script and insert the new data in the model. But if you have a MySQL connector in Matlab it would be even better, I can provide the SQL queries.

Note: alternative genes allele/primary assembly has been also mentioned under issue #15

haowang-bioinfo commented 5 years ago

yeah, it would be nice if we can come up with a Matlab-SQL solution.

JonathanRob commented 5 years ago

For now, the proposed plan is to filter the Ensembl database file in the repo (ensembl_ID_mapping_20180903.txt) such that non-primary gene IDs are removed. This filtered database file can then be used to update humanGEM to remove non-primary genes from GPRs.

We can think of more automated methods for the future, but for now this is the simplest solution.

pecholleyc commented 5 years ago

I finalized the python/MySQL script, It is a bit different from my previous description:

Input:

the version of Ensembl (current version 95)
the version of the human genome (current version 38) (these 2 parameters are required to connect to a database, e.g. homo_sapiens_core_95_38)
the name/path of the output file to be created

The output is written into the file specified as 3rd argument, the structure of the file is similar to ensembl_ID_mapping_20180903.txt. It contains the IDs for all genes in Human mapped to the primary assembly genome.

SQL queries have been provided by Ensembl's helpdesk (tickets #320660 and #323315)

haowang-bioinfo commented 5 years ago

@pecholleyc thanks for the explanation, which clarified the source and content of ensembl_ID_mapping_20190207.tsv that was uploaded by 0f6f9e8.

This python script probably should be included humanGEM, even though it's in Python. A possible location could be ~/ComplementaryScripts/GPRs.

haowang-bioinfo commented 5 years ago

I noticed that the current naming style of the gene-ids tsv file would cause problems after updating to new versions. One solution is having a constant name, instead of with variable date string, so that there would be always just one file (e.g. ensembl_ID_mapping.tsv) and preferably with one description line including Ensembl and genome versions in the head. And this would avoid mis-calling by other associated functions/scripts.

JonathanRob commented 5 years ago

@Hao-Chalmers Yes, I agree completely. The original intent was to keep track of the date when it was retrieved, but I can see now that it actually causes a lot of issues with having to update functions/scripts accordingly. We can now instead store that type of information in a header or some other kind of metadata in the .tsv file itself.

SysBioChalmers / Human-GEM

Gene alleles #68

Description of the issue: