Update TREAT-AD biodomains in enrichment/correlation plot

rbutleriii commented 11 months ago

Now has updated biodomains in medRxiv paper for I think 18-19 domains.

rbutleriii commented 9 months ago

Per the GSEA guidance, it does not recommend filtering genes prior to running the analysis (we already filter 0 count genes prior to DE). At this point it can just be run it on the (shrunken)L2FC

cyouh95 commented 9 months ago

"First, we identified 19 biological domains that capture the preponderance of AD-associated endophenotypes and defined them using an exhaustive set of Gene Ontology (GO) terms, with the intent to keep each domain siloed in a biologically coherent fashion."

Supplemental table S2 contains GO term to biodomain mapping

Next steps:

Using table S2, create individual gene to biodomain mapping
- Use biomaRt? Is there a script that generated the original reference/biodomains/TREAT-AD.Mouse.Genes.txt I can see?
Update biodomains in enrichment/correlation plots + GSEA plots

rbutleriii commented 9 months ago

Ideally we would use the synapse files we use for the GSEA plots. Those have all the information we need to map genes to BioDomains. I think they were first downloaded to the /labs/flongo/TREAT-AD/Biodomain_anntoations folder. There are a few scripts in there that make use of them, but I don't think they were in a single file. Also, I grabbed several other sets of genes like those in igraphs and those in the year2 hypothesis gene sets. I will take a look to see if I did that in a different folder. I have to figure out which project I first did the biodomain enrichment on.

rbutleriii commented 5 months ago

@cyouh95 , I think this one is already done in the dashboard & biodomain_enrichment.R scripts?

rbutleriii commented 5 months ago

The parent script is in /labs/flongo/TREAT-AD/TIP60/all_gene_lists.R. all_gene_lists.R.

This would probably be a good time to move it to the /labs/flongo/reference/biodomains folder and update the gene lists (as well as give it a script header and reformat for correct R syntax).

We can also review its conversion setup to make sure it is the best option.

cyouh95 commented 5 months ago

Script added to reference/biodomains/all_gene_lists.R

Notes:

ensembl_ver defined at top for fetching conversion data from biomaRt (i.e., human gene symbol to Ensembl ID, human Ensembl ID to mouse Ensembl ID)
Do we want to filter the genes list by gene_biotype and chromosome_name? Currently only being done if using gene symbol to Ensembl ID conversion (need to fix to be consistent). Alternatively, save full genes list here without filtering and filter later as needed (for example, we filter the list in genes_info.R for use in dashboards)

Sources for reference sets:

TreatAD:
- syn25428992 (annotated_biodomains_hsap.RDS)
Mostafavi:
- Table S3 (NIHMS962042-supplement-Sup_Table_3.xlsx)
Milind:
- syn23660902 (rosmap_submodule_sig_genes.RDS)
- syn23660898 (mayo_submodule_sig_genes.RDS)
- syn23660900 (mssm_submodule_sig_genes.RDS)
- Also used Milind_etal.group_annots.txt to filter modules, which was already existing inside reference/biodomains/
Wan:
- AD_modules_full.tsv (not sure of source, file already existed inside reference/biodomains/)
- Also used Figure S1 legend to define consensus clusters
TmtAD:
- Table S4 (41593_2021_999_MOESM3_ESM.xlsx, 4.Consensus.kMEs sheet)

rbutleriii commented 3 months ago

For TREAT-AD, they have updated their list of annotated mouse biodomains, so we can simplify the process and avoid build errors by not executing a human look-up at all see.

In terms of filtering, we do need to be consistent, the multiple filtering steps generally occur as a catch-all as each lookup can produce duplicates. This is especially the case for genes that exist on patch chromosomes (hence the chromosome filter), as there are duplicate gene IDs for a gene that is not a true 1:many ortholog. Although theoretically there exist some genes that are only on patches and haven't been fully integrated into the genome.

The biotype filter is more vague, as it is Ensembl specific, and certain genes (like Clec7a) have been classified as pseudogenes when they are very much expressed. Clec7a has been corrected since I think version 110. To some degree though this is as has to be, as plenty of genuine pseudo or TEC genes are not really biologically reliable. That they were included in a reference set is great and all, but we should probably make sure they are filtered out for all sets, human and mouse.

One caution though, is any ensembl lookup step will introduce duplicates (the *:many issue). I am looking into this again, as at some point we could adopt a 1:1 only policy to simplify our lives, but that will come at the expense of excluding biologically important genes because...its hard? 🤷 In any case, we would want to implement that as a broad policy, so not quite there yet.

Subsequent filtering in later scripts is good to keep, as in certain analysis types you may only examine protein_coding genes but that can be specified there.

So, to do:

[x] download mouse TREAT-AD (or explore on the fly fetching from synapse)
[x] add new resilience proteomics set
[x] generate filtered set for human and mouse for each

rbutleriii commented 3 months ago

Linked to https://github.com/Longo-Lab/PS19_C31_snRNAseq/issues/122

rbutleriii commented 3 months ago

Based on a bunch of checking, plus the persistent errors with the biomaRt queries, it seems like the most stable solution is to use gprofiler's gorth tool. It updates regularly from biomaRt, and offers a "most popular" ortholog for the 1:many issues. I put in an contact to them on how it resolves that, but for instance with the mouse Mt1 to 10 human MT1 genes, it returns MT1F.

It also returns the mouse symbol, mouse ENSID, human symbol, and human ENSID in one go, making lookups a single query that is much faster:

library(gprofiler2)

mGenes = c(
  "Hjurp", "6430706D22Rik", "A730008H23Rik", "Arl4c", "Sh3bp4", "Mt1"
)

# gprofiler output
y <- data.table(
  gorth(
    query = mGenes, 
    source_organism = "mmusculus", 
    target_organism = "hsapiens", 
    mthreshold = 1,
    filter_na = FALSE
  )
)

filter_na = FALSE would actually keep the empty rows, making joins relatively easy (though data.table handles it fine with a left join all.x = TRUE).

Will modify the lookup script to simplify the mouse to human, human to mouse exchanges in all_gene_lists.R

rbutleriii commented 2 months ago

Hmm, they got back to me and noted that the gorth ordering is arbitrary, so not a great one to use. At this point we can utilize the updated biomaRt and follow their methodology for high-confidence orthologies modified to select the highest ranking orthology per one2many gene. While we are at it, we can standardize the function to do a couple things:

[x] go both directions mouse -> human or human -> mouse
[x] specify whether the in/out is symbols or ids (the ensembl homologs page includes this information)
[x] order and select top ortholog (has to resolve: one2many, many2one, many2many)
[x] stores a local version of the table while the script is running to avoid multiple lookups (avoid dodgy ensembl connection).

rbutleriii commented 2 months ago

Also link https://github.com/Longo-Lab/scg_backup/issues/8

rbutleriii commented 2 months ago

Will have to return to generate the new filtered gene sets, but the function is implemented. Perhaps in the future refactor and/or relocate to a function script in scripts

rbutleriii commented 1 month ago

Still missing the resilience proteomics set in our 01-all_gene_lists.R file. Will have to consolidate.

rbutleriii commented 1 month ago

All have been added, now need to double check enrichment/correlation scripts in scripts to make sure they are updated.

[x] biodomain_TREAT-AD.R
[x] biodomain_Wan_etal.R
[x] biodomain_correlation.R
[x] biodomain_enrichment_functions.R
[x] biodomain_enrichment.R
[x] biodomain_Milind_etal.R
[x] biodomain_Mostafavi_etal.R
[x] biodomain_TMT-AD.R
[x] biodomain_TMT-Resilience.R

Also updated in reference/biodomains

[x] gene_info.R --> 02-gene_info.R

581f78f

rbutleriii commented 1 month ago

Need to demo these on an enrichment set. Ideally not a full single-cell one....

rbutleriii commented 3 weeks ago

Testing with PS19_C31_stim

rbutleriii commented 2 weeks ago

Returning to testing, it is going to be simpler to rely on 1:1 orthologs for now as the most easily to publish without the assumption that top match is better. Perhaps long term we can develop that as an independent analysis.

Longo-Lab / de_dashboards

Update TREAT-AD biodomains in enrichment/correlation plot #12