IGS / gEAR

The gEAR Portal was created as a data archive and viewer for gene expression data including microarrays, bulk RNA-Seq, single-cell RNA-Seq and more.
https://umgear.org
GNU Affero General Public License v3.0
10 stars 5 forks source link

Gene Mapping System Used in Inner Ear Organoid (Steinhart) Dataset #653

Closed gear-portal-team closed 2 months ago

gear-portal-team commented 3 months ago

From: Toby Clark

Email: trc43@cam.ac.uk

Server IP: 10.142.0.16

Msg: Hello, Sorry as this is not directly related to your program, but I've seen that gEAR can provide the ENSGID for any of the gene names from the study, and I haven't been able to find any other way to do this. Would you be able to tell me what gene symbol-ensgID mapping you use/how I could use it myself?

Thanks and best wishes, Toby

Tags: ['RNAseq']

Screenshot: None

jorvis commented 3 months ago

@toby-clark4 - if you could clarify. Are you interested in how we map gene symbols to ENSEMBL IDs in general for datasets which initially don't have them, or you want to download this individual dataset which has its genes and ensembl IDs mapped already?

toby-clark4 commented 3 months ago

@jorvis - thanks for the reply. I'm more interested in the first point - I currently have the dataset in RDS format with gene symbols but no ensembl IDs, but can't figure out the mapping used to connect the gene symbols and ensembl IDs, which I need to tokenize the data. Searching the symbols with the gEAR dataset gives a link to the ensembl page for each gene, so I was wondering what mapping system you use for this?

jorvis commented 2 months ago

Got it. So the general strategy we use has the following steps:

  1. Load the full annotation from several versions of Ensembl releases of each organism in gEAR (mouse, human, etc.)
  2. If we don't know which release the input file is for, check the gene symbol pool against all releases loaded to determine which has the best overlap. That is, your input file may be mouse, but if you don't know whether it's based on mouse release 88, 94, 101, etc, see which gene set best overlaps those that exist in each release.
  3. Once you have the release number, use the loaded annotation to add Ensembl IDs for the gene symbols which are present (and save a separate file of those which didn't map.

These steps are performed with the following scripts:

  1. https://github.com/IGS/gEAR/blob/main/bin/load_ensembl_gbk_annotations.py
  2. https://github.com/IGS/gEAR/blob/main/bin/find_best_ensembl_release_match.py
  3. https://github.com/IGS/gEAR/blob/main/bin/add_ensembl_ids_to_tab_file.py

And a prerequisite of #1 is that you've created a database using our schema file before loading:

https://github.com/IGS/gEAR/blob/main/create_schema.sql

(although only subset of all that is used for this purpose)

It's a lot, I know, but it's what supports the gEAR overall. It wasn't written as a stand-alone mapping utility!

Alternatively, tools like BioMart should allow you to do this.

jorvis commented 2 months ago

Closing. Please re-open if there are more questions.