SysBioChalmers / Human-GEM

The generic genome-scale metabolic model of Homo sapiens
https://sysbiochalmers.github.io/Human-GEM-guide/
Creative Commons Attribution 4.0 International
95 stars 40 forks source link

Gene alleles #68

Closed JonathanRob closed 5 years ago

JonathanRob commented 5 years ago

Description of the issue:

The model currently contains multiple alleles of many genes, which have identical gene abbreviations, but different Ensembl gene IDs. For example, the VARS gene (ENSG00000204394) has 6 alleles:

  1. ENSG00000231116
  2. ENSG00000226589
  3. ENSG00000096171
  4. ENSG00000224264
  5. ENSG00000227686
  6. ENSG00000231945

All of these IDs are currently all present in the model. The ENSG00000204394 ID corresponds to the genotype of the primary assembly of the human genome, and will therefore be the gene ID that is typically present in, e.g., RNA-Seq data.

In my opinion, there is no need for all of these gene variants, and leaving them in the model may lead to confusion. I propose that we remove these allelic gene variants, and keep only the gene IDs corresponding to the primary human genome assembly. This will also help in simplifying many gene-reaction rules, which would be beneficial.

I hereby confirm that I have:

pecholleyc commented 5 years ago

Ensembl refers them as:

Alternate sequences in human

All genome assemblies in Ensembl are haploid, and for most species there is only a single path through the genome. Currently, human is the only genome assembly where there is more than one path through the genome. The human genome assembly is maintained by the Genome Reference Consortium (GRC). The GRCh37 primary assembly comprises 24 chromosomes plus 39 unplaced scaffolds. In addition to the primary assembly, the GRCh37 major assembly release included 9 alternate loci including 6 haplotypes on the MHC region of chromosome 6. Subsequent minor releases on the GRCh37 assembly introduce additional alternate sequences known as patches.

There are two types of assembly patches:

Novel patches: provide alternate alleles. These regions are colored red in the Chromosome summary page and Region in detail page.
Fix patches: provide improved sequence for known assembly errors. These patches will be incorporated into the primary assembly in the next major assembly release. They are colored green in the Chromosome summary page and Region in detail page.

Minor assembly releases have the following naming convention: GRCh37.p7 for the seventh patch release of GRCh37.

I agree I think we can discard those "secondary IDs". I used BioMart on the Ensembl website to get only the list of IDs on the primary assembly. The query and results are available here

I filtered on names of scaffold/chromosome to, as it is explains in this post https://www.biostars.org/p/109510/

ens_stableID_primary_assembly.txt

Now, a way to correct the genes IDs of the model would be I think, to get the names from the existing IDs of the model (can be done with BioMart as well) cross ref with the file and replace when different.

Update:

JonathanRob commented 5 years ago

@pecholleyc Great, thank you. We can plan to update the model with this gene information in the near future.

haowang-bioinfo commented 5 years ago

It should be alright to remove (or merge) allelic gene variants that are (nearly) identical to each other and clearly have the same function.

For the allelic variants with different functions (e.g. associated with varied or different phenotypes), they should be kept and somehow properly organized (not sure if there is any of these in the model).

pecholleyc commented 5 years ago

I discussed with the helpdesk of Ensembl the best way to have the primary assembly IDs is to use the MySQL database (or using the BioPerl API... ), I already made a python-mysql script but I need you feedback before finalizing it, here is what I suggest:

The script will use as input:

The script will output (or update the input file) with 2 columns per line:

ID                   primary assembly ID
 ENSG00000285332      ENSG00000284826 

There are some points to consider:

Someone can work on a sort of wrapper in Matlab to call the script and insert the new data in the model. But if you have a MySQL connector in Matlab it would be even better, I can provide the SQL queries.

Note: alternative genes allele/primary assembly has been also mentioned under issue #15

haowang-bioinfo commented 5 years ago

yeah, it would be nice if we can come up with a Matlab-SQL solution.

JonathanRob commented 5 years ago

For now, the proposed plan is to filter the Ensembl database file in the repo (ensembl_ID_mapping_20180903.txt) such that non-primary gene IDs are removed. This filtered database file can then be used to update humanGEM to remove non-primary genes from GPRs.

We can think of more automated methods for the future, but for now this is the simplest solution.

pecholleyc commented 5 years ago

I finalized the python/MySQL script, It is a bit different from my previous description:

Input:

The output is written into the file specified as 3rd argument, the structure of the file is similar to ensembl_ID_mapping_20180903.txt. It contains the IDs for all genes in Human mapped to the primary assembly genome.

SQL queries have been provided by Ensembl's helpdesk (tickets #320660 and #323315)

haowang-bioinfo commented 5 years ago

@pecholleyc thanks for the explanation, which clarified the source and content of ensembl_ID_mapping_20190207.tsv that was uploaded by 0f6f9e8.

This python script probably should be included humanGEM, even though it's in Python. A possible location could be ~/ComplementaryScripts/GPRs.

haowang-bioinfo commented 5 years ago

I noticed that the current naming style of the gene-ids tsv file would cause problems after updating to new versions. One solution is having a constant name, instead of with variable date string, so that there would be always just one file (e.g. ensembl_ID_mapping.tsv) and preferably with one description line including Ensembl and genome versions in the head. And this would avoid mis-calling by other associated functions/scripts.

JonathanRob commented 5 years ago

@Hao-Chalmers Yes, I agree completely. The original intent was to keep track of the date when it was retrieved, but I can see now that it actually causes a lot of issues with having to update functions/scripts accordingly. We can now instead store that type of information in a header or some other kind of metadata in the .tsv file itself.