Open rbutleriii opened 11 months ago
Per the GSEA guidance, it does not recommend filtering genes prior to running the analysis (we already filter 0 count genes prior to DE). At this point it can just be run it on the (shrunken)L2FC
"First, we identified 19 biological domains that capture the preponderance of AD-associated endophenotypes and defined them using an exhaustive set of Gene Ontology (GO) terms, with the intent to keep each domain siloed in a biologically coherent fashion."
Next steps:
biomaRt
? Is there a script that generated the original reference/biodomains/TREAT-AD.Mouse.Genes.txt
I can see?Ideally we would use the synapse files we use for the GSEA plots. Those have all the information we need to map genes to BioDomains. I think they were first downloaded to the /labs/flongo/TREAT-AD/Biodomain_anntoations
folder. There are a few scripts in there that make use of them, but I don't think they were in a single file. Also, I grabbed several other sets of genes like those in igraphs
and those in the year2 hypothesis
gene sets. I will take a look to see if I did that in a different folder. I have to figure out which project I first did the biodomain enrichment on.
@cyouh95 , I think this one is already done in the dashboard
& biodomain_enrichment.R
scripts?
The parent script is in /labs/flongo/TREAT-AD/TIP60/all_gene_lists.R
. all_gene_lists.R
.
This would probably be a good time to move it to the /labs/flongo/reference/biodomains
folder and update the gene lists (as well as give it a script header and reformat for correct R syntax).
We can also review its conversion setup to make sure it is the best option.
Script added to reference/biodomains/all_gene_lists.R
Notes:
ensembl_ver
defined at top for fetching conversion data from biomaRt
(i.e., human gene symbol to Ensembl ID, human Ensembl ID to mouse Ensembl ID)gene_biotype
and chromosome_name
? Currently only being done if using gene symbol to Ensembl ID conversion (need to fix to be consistent). Alternatively, save full genes list here without filtering and filter later as needed (for example, we filter the list in genes_info.R
for use in dashboards)Sources for reference sets:
reference/biodomains/
reference/biodomains/
)For TREAT-AD, they have updated their list of annotated mouse biodomains, so we can simplify the process and avoid build errors by not executing a human look-up at all see.
In terms of filtering, we do need to be consistent, the multiple filtering steps generally occur as a catch-all as each lookup can produce duplicates. This is especially the case for genes that exist on patch chromosomes (hence the chromosome
filter), as there are duplicate gene IDs for a gene that is not a true 1:many ortholog
. Although theoretically there exist some genes that are only on patches and haven't been fully integrated into the genome.
The biotype
filter is more vague, as it is Ensembl specific, and certain genes (like Clec7a
) have been classified as pseudogenes when they are very much expressed. Clec7a
has been corrected since I think version 110. To some degree though this is as has to be, as plenty of genuine pseudo or TEC genes are not really biologically reliable. That they were included in a reference set is great and all, but we should probably make sure they are filtered out for all sets, human and mouse.
One caution though, is any ensembl lookup step will introduce duplicates (the *:many
issue). I am looking into this again, as at some point we could adopt a 1:1 only
policy to simplify our lives, but that will come at the expense of excluding biologically important genes because...its hard? 🤷 In any case, we would want to implement that as a broad policy, so not quite there yet.
Subsequent filtering in later scripts is good to keep, as in certain analysis types you may only examine protein_coding
genes but that can be specified there.
So, to do:
Based on a bunch of checking, plus the persistent errors with the biomaRt queries, it seems like the most stable solution is to use gprofiler's gorth tool. It updates regularly from biomaRt, and offers a "most popular" ortholog for the 1:many
issues. I put in an contact to them on how it resolves that, but for instance with the mouse Mt1
to 10 human MT1
genes, it returns MT1F
.
It also returns the mouse symbol, mouse ENSID, human symbol, and human ENSID in one go, making lookups a single query that is much faster:
library(gprofiler2)
mGenes = c(
"Hjurp", "6430706D22Rik", "A730008H23Rik", "Arl4c", "Sh3bp4", "Mt1"
)
# gprofiler output
y <- data.table(
gorth(
query = mGenes,
source_organism = "mmusculus",
target_organism = "hsapiens",
mthreshold = 1,
filter_na = FALSE
)
)
filter_na = FALSE
would actually keep the empty rows, making joins relatively easy (though data.table handles it fine with a left join all.x = TRUE
).
Will modify the lookup script to simplify the mouse to human, human to mouse exchanges in all_gene_lists.R
Hmm, they got back to me and noted that the gorth
ordering is arbitrary, so not a great one to use. At this point we can utilize the updated biomaRt and follow their methodology for high-confidence orthologies modified to select the highest ranking orthology per one2many gene. While we are at it, we can standardize the function to do a couple things:
Will have to return to generate the new filtered gene sets, but the function is implemented. Perhaps in the future refactor and/or relocate to a function script in scripts
Still missing the resilience proteomics set in our 01-all_gene_lists.R
file. Will have to consolidate.
All have been added, now need to double check enrichment/correlation scripts in scripts
to make sure they are updated.
Also updated in reference/biodomains
Need to demo these on an enrichment set. Ideally not a full single-cell one....
Testing with PS19_C31_stim
Returning to testing, it is going to be simpler to rely on 1:1
orthologs for now as the most easily to publish without the assumption that top
match is better. Perhaps long term we can develop that as an independent analysis.
Now has updated biodomains in medRxiv paper for I think 18-19 domains.