AzSaied / Az_MAGMA_Benchmarking

This is a repo for my MAGMA benchmarking project with Imperial
1 stars 0 forks source link

Define a method for choosing GWAS for our 'Truth matrix' #3

Open AzSaied opened 8 months ago

AzSaied commented 7 months ago

Here, I want to know which GWASs to include. Broadly speaking - I want GWASs that I would expect to have a strong signal in one cell type, and that I could (somehow) be confident in there being low/no signal in all other celltypes.

Failing that - I would settle for a predictable signal from the other celltypes.

AzSaied commented 7 months ago

Contenders / things I need to read more about:

AzSaied commented 7 months ago

Cell type specific annotations:

Roadmap Epigenomics consortium generated an epigenomic map across diverse tissues and cell types. Integrating this resource with GWAS results enables the prioritisation of cell types relevant to specific phenotypes.

Paper: Roadmap Epigenomics Consortium et al. (2015) Integrative analysis of 111 reference human epigenomes. Nature, 518(7539), 317-330.

Image giving an example Truth matrix below. Note - this image is a subset of a larger truth matrix they made. This includes cell types but also less specific tissue types

AzSaied commented 7 months ago

Image

Epigenomic enrichments of genetic variants associated with diverse traits. Tissue-specific H3K4me1 peak enrichment significance (-log10 P value) for genetic variants associated with diverse traits.

AzSaied commented 7 months ago

RegulomeDB queries any given variant by intersecting its position with the genomic intervals that were identified to be functionally active regions from the computational analysis outputs of functional genomic assays such as TF ChIP-seq and DNase-seq (from the ENCODE database) as well as those overlapping the footprints and QTL data.

All the source data used in RegulomeDB v2.1 can be found on the ENCODE website

https://regulomedb.org/GWAS

-> compile a new truth matrix.

AzSaied commented 7 months ago

Epigenomic annotation of genetic variants using the Roadmap Epigenome Browser

http://epigenomegateway.wustl.edu/browser/roadmap/)

The browser takes advantage of the over 10,000 epigenomic data sets it currently hosts, including 346 'complete epigenomes', defined as tissues and cell types for which we have collected a complete set of DNA methylation, histone modification, open chromatin and other genomic data sets9. Data from both the NIH Roadmap Epigenomics and ENCODE resources are seamlessly integrated in the browser using a new Data Hub Cluster framework.

Investigators can specify any number of single nucleotide polymorphism (SNP)-associated regions and any type of epigenomic data, for which the browser automatically creates virtual data hubs through a shared hierarchical metadata annotation, retrieves the data and performs real-time clustering analysis.

AzSaied commented 7 months ago

Image

AzSaied commented 7 months ago

Image

AzSaied commented 7 months ago

"Large genomic consortia (for example, the Encyclopedia of DNA Elements (ENCODE)12) are generating an unprecedented volume of data on the function of genetic variation.

The Genotype-Tissue Expression (GTEx) Project13 is a US National Institutes of Health (NIH) Common Fund project that aims to collect a comprehensive set of tissues from 900 deceased donors (for a total of about 20,000 samples) and to provide the scientific community with a database of genetic associations with molecular traits such as mRNA levels (see the main report on GTEx14 for Phase 1 data).

Other large-scale transcriptome data sets include Genetic European Variation in Health and Disease15 (GEUVADIS; 460 lymphoblastoid cell lines (LCLs)),

Depression Genes and Networks (DGN; 922 whole-blood samples)16 and

Braineac (130 individuals with multiple brain region samples)17.

Yet, effective methods that harness these reference transcriptome data sets for disease mapping are lacking."

Gamazon, E., Wheeler, H., Shah, K. et al. A gene-based association method for mapping traits using reference transcriptome data. Nat Genet 47, 1091–1098 (2015). https://doi.org/10.1038/ng.3367

AzSaied commented 6 months ago

Another option:

MiXcan: a framework for cell-type-aware transcriptome-wide association studies https://www.nature.com/articles/s41467-023-35888-4

MiXcan is a cell-type-aware transcriptome-wide association study approach that predicts cell-type-level expression, identifies disease-associated genes via combination of cell-type-level association signals for multiple cell types, and provides insight into the disease-critical cell type

AzSaied commented 6 months ago

Meeting with Brian

Advised using the Monarch initiative (https://previous.monarchinitiative.org/about/monarch)

The Monarch Initiative integrates, aligns, and re-distributes cross-species gene, genotype, variant, disease, and phenotype data.

Monarch lead the development of the Human Phenotype Ontology, which is used across the world for genomic diagnostics in genetic disease and other areas.

Monarch are a Driver Project for the Global Alliance for Genomics and Health (GA4GH), and are major contributors to the development of genomics standards within GA4GH.

Additionally, Monarch have developed Mondo, a unified disease ontology that represents the most comprehensive integration of disease entities ever achieved.

AzSaied commented 6 months ago

Specifically - Brian has downloaded a knowledge graph from Monarch

monarch_kg_cells.csv

The Monarch knowledge graph uses data from: Alliance - a subset of model organism data from member databases that is harmonized to the same model BGee - Bgee is a database for retrieval and comparison of gene expression patterns across multiple animal species, produced from multiple data types (bulk RNA-Seq, single-cell RNA-Seq, Affymetrix, in situ hybridization, and EST data) and from multiple data sets (including GTEx data). CTD - Comparative Toxicogenomics Database is a robust, publicly available database that aims to advance understanding about how environmental exposures affect human health GOA - The Gene Ontology Annotation Database compiles high-quality Gene Ontology (GO) annotations to proteins in the UniProt Knowledgebase (UniProtKB), RNA molecules from RNACentral and protein complexes from the Complex Portal. HGNC - (HUGO Gene Nomenclature Committee) The HGNC is responsible for approving unique symbols and names for human loci, including protein coding genes, ncRNA genes and pseudogenes, to allow unambiguous scientific communication HPOA - The Human Phenotype Ontology group curates and assembles over 115,000 annotations to hereditary diseases using the HPO ontology. Here we create Biolink associations between diseases and phenotypic features, together with their evidence, and age of onset and frequency (if known). There are four HPOA ingests - 'disease-to-phenotype', 'disease-to-mode-of-inheritance', 'gene-to-disease' and 'disease-to-mode-of-inheritance' - that parse out records from the HPO Annotation File. NCBI - National Center for Biotechnology Information integrates information from a wide range of species. A record may include nomenclature, Reference Sequences (RefSeqs), maps, pathways, variations, phenotypes, and links to genome-, phenotype-, and locus-specific resources worldwide. Panther - PANTHER (Protein ANalysis THrough Evolutionary Relationships) Classification System Panther Gene Orthology analyses generate testable hypothesis about gene function and biological processes using experimental results from other (especially highly studied so-called 'model' species) using protein (and sometimes, simply nucleic acid level) alignments of genomic sequences. Phenio - PHENIO is an ontology for accessing and comparing knowledge concerning phenotypes across species and genetic backgrounds. Pombase - PomBase is a comprehensive database for the fission yeast Schizosaccharomyces pombe Reactome - Reactome is a free, open-source, curated and peer reviewed pathway database. String - STRING is a database of known and predicted protein-protein interactions. The interactions include direct (physical) and indirect (functional) associations; they stem from computational prediction, from knowledge transfer between organisms, and from interactions aggregated from other (primary) databases. ZFIN - ZFIN is the Zebrafish Model Organism Database.

AzSaied commented 6 months ago

Using the Monarch knowledge graph - I can quickly and easily create a 'Truth Matrix' with just binary values (Most affected cell type = 1, all other cell types = 0). This is a start but will not provide sufficient granularity / nuance to distinguish between closely matched MAGMA_Celltyping experiments.

Image

This is all well and good...but I would need to: 1) Tie in each 'disease' to a phenotype of a published GWAS, and 2) Map each cell type to the cell types used in MAGMA Cell typing

Before doing either of those - I need to see if I can interrogate the Monarch data to find other celltypes which are affected for a given disease.

AzSaied commented 6 months ago

the "semantic backbone" of the Monarch Knowledge Graph comes from PHENIO (https://monarch-initiative.github.io/monarch-ingest/Sources/phenio/)

Designed as an application ontology, PHENIO integrates a variety of ontological concepts, in particular the "core entities" in the Monarch Knowledge Graph (KG), including diseases, phenotypes and anatomical entities.

PHENIO integrates several different types of hierarchical relationships from a variety of sources.

These include: Chemical entities and relationships from CHEBI Disease entities and relationships from MONDO * Abnormal phenotypes of humans (HPO), mouse and other mammalian species (MPO), the nematode worm Caenorhabditis elegans (WBBT), and zebrafish (ZFA).

A full list of files used in the construction of PHENIO is available here.

AzSaied commented 6 months ago

Discussion with Alan:

Making the point that even if Monarch / MONDO is able to provide some data on cell type associations which are not the most prominent, we can not take their information as a ground truth. If the MONDO disease ontology is constantly being updated for discoveries of cell type associations, we can not assume their knowledge is complete, and it is unlikely to be updated with confirmed negative associations between various cell types and phenotypes.

This represents a fairly major pivot in my direction of travel. Until now - I thought the problem I was trying to solve could be reduced down to finding a source of data into which I could plug in some input (be it a phenotype, SNPs, genes or genesets) and get a reliable cell type association...and so long as my means of establishing this association was distinct from the MAGMA Celltyping method, could be used to independently verify / measure MAGMA Celltyping performance.

I now understand talking to Alan that even with the best maintained data source in the world, I will not be able to assume complete enough knowledge to establish a 'ground truth'.

Alan and I discussed alternative approaches.

One would be to look at only the one cell type which we can be confident is effected in a given phenotypes. I'm paraphrasing here but in effect use these to measure the sensitivity of MAGMA Celltyping. An example of this could be to look at cancers. I can be very confident that a squamous cell carcinoma of the skin would effect cutaneous squamous cells. I suppose the point here is that by choosing not to look at any other cell types we gain certainty in our measure, at the cost of a lot of precision / resolution; but if we include a large number of cancer GWAS we can partially mitigate this to some extent.

We didn't discuss it in so many words, but what would be nice would be if we could use the same logic to create a test with sufficient precision/resolution to measure specificity. This would be more tricky as selecting celltypes which we can be very confident are not associated with a given phenotype does not have an obvious 'cancer' analogy. On the one hand, this approach might just feel like kicking the can down the road, but on the other - maybe this approach of taking one (/ a few) cell types from a large number of studies might simplify the problem somewhat.

AzSaied commented 5 months ago

I am compiling (two) list(s) of candidate GWAS on the basis of the above idea. The first (1) to measure sensitivity, with a small number of celltypes known to be affected for a given trait. The second (2) to measure specificity where I have to get a list of celltypes that I have high confidence are not affected for a given trait. This is more tricky.

My idea for (1) is to select traits where we would have a very high confidence of a specific celltype being affected - eg Cancers, non ischaemic cardiomyopathy.

Depending on how many of those there are, and of what size and quality, we could stray into some where the celltype is les certain but we remain confident based on 'textbook' knowledge eg. Type 1 diabetes and Pancreatic beta cells, AIDS susceptibility and CD-4 cells, etc.

My idea for (2) - there are GWAS where the trait in question is not a disease per se - but the expression level of a gene or protein, eg:

There are hundreds upon hundreds of these - which come up if you put 'protein' in the GWAS catalogue.

For the gene expression GWAS - I could use things like RNA-seq data in GTEx, Illumina, BioGPS, and SAGE (Serial Analysis of Gene Expression)...

For the protein expression GWAS - I could use ProteomicsDB, PaxDb, MaxQB, and MOPED data...

...to find celltypes that have no association or expression of the gene or protein in question.

AzSaied commented 4 months ago

Nathan's comments after my presentations: To consider if QTLs could be used. I will read this paper: https://www.nature.com/articles/s41588-021-00913-z

And to read this paper on 'Plasma proteomic associations with genetics and health in the UK Biobank' https://www.nature.com/articles/s41586-023-06592-6#Sec4

I will look at tools ProteomicsDB, PaxDb, MaxQB, and MOPED to see if I can identify any example proteins which are defining characteristics of otherwise closely related cells in a given tissue (I will start with brain)

AzSaied commented 1 month ago

In the process of choosing which GWAS to use in the matrix. It makes sense to pair proteomic genes with the cancer GWAS too - eg if looking at basal cell carcinoma of the skin as one of my cancer GWAS studies - I should think about skin when choosing my proteins too. Specifically - by looking at all of the cell types in the skin, and choosing proteins which are highly expressed in some skin cells, and not expressed at all in others.

...etc

AzSaied commented 1 month ago

Diffuse large B cell lymphoma + chronic lymphoid leukaemia -> bone marrow +/- pbmc (peripheral blood mononuclear cell)

Bone marrow: SEMG1 - 3x0. [b-cell, erythroid cell, plasma cell] v.v.high max expression in t and macro, none in b, eryth or plasma. A0l. LALBA - 3x0, [b-cell, erythroid cell, plasma cell]. high t and macroph, low in others. All 0 locally PATE4 - 3x0, [b-cell, erythroid cell, plasma cell] lots in t cells and macrophages HTN3 - 1x0. [erythroid] v high - none in eryth. All 0 locally though SEMG2 - 3x0 [b-cell, erythroid cell, plasma cell], v.high max expression in t and macro, none in b, eryth or plasma. A0l HTN1 - 1x0, [erythroid] v.v high - none in eryth. All 0 locally though KIR2DL4 - [erythroid] 1x0, mostly in t-cells. DECENT LOCAL READS MACRO - super high macro, zero erythroid, DECENT LOCAL READS

PAEP - 1x0, high t and macroph, low in others. All 0 locally S100A7 - 1x0, high t and macroph, low in others. All 0 locally PIP - 1x0, high in everything but 0 in erythroid. All 0 locally SMR3B - 1x0, v.high in everything but 0 in erythroid. All 0 locally SFTPC - 1x0, high t, b, macro. low plasma, 0 erythroid. All 0 locally. CCL18 - 1x0 never expressed in erythroid cells, high max expression in macrophages MRC1 - 1x0, never expressed in erythroid cells, high max expression in macrophages. LOCAL READS PRB4 - 1x0, not in erythroid, lots in macrophages, locally all 0 though CXCL8 - 1x0, Super high macro, 0 erythroid, tiny LOCAL READS DUSP4 - 1x0, high in everything but 0 in erythroid. SOME LOCAL READS! IER3 - 1x0, high in everything but 0 in erythroid. SOME LOCAL READS! TNFSF9 - 1x0, high in everything but 0 in erythroid. SOME LOCAL READS!

AzSaied commented 1 month ago

Shortlist for pbmc

SEMG1 - 5x0 [nk cells, monocytes, b-cells, dendritic, platelets]

APCS - 5x0, [macrophages, nk cells, monocytes, dendritic cells, platelets], t cell - small

HTN3 - 4x0, [nk, mono, platelet, dendritic], macro+++, t,b ++ - all 0. All 0 locally

PRB4 - 4x0 [nk cells, monocytes, dendritic, platelets]. Macro +++, t, b ++

HTN1 - 3x0, [monocytes, dendritic cells, platelets]. macro,t,b +++

TMEM40 - [dendritic] , platelets +. Local reads present.

CLEC1B - 1x0 [dendritic] , platelets +. Local reads present.

AzSaied commented 1 month ago

Shortlist of proteins for skin:

HTN3 - [langerhans, melanocytes], massive macrophages, sm, t,

GPC3 - [melanocytes], fibroblasts, sm+++, LOADS OF LOCAL

SFRP4 - [melanocytes], fibroblasts +++, LOADS OF LOCAL

SLURP1 - [b-cells], suprabasal keratinocytes+++

PRB4 - [langerhans, melanocytes], all 0 locally

SEMG1 - [langerhans, basal keratinocytes, melanocytes, b, granulocytes]

MARCO - [melanocytes], macrophage ++, SOME LOCAL

PCSK2 - [granulocytes] melanocytes +, ALL LOCAL

CRABP1 - [granulocytes] everything else - +, ALL LOCAL

S100A7 - [melanocytes], basal keratin +++

AzSaied commented 2 weeks ago

Here is the magic code Alan wrote which concatenated all of the separate gz files into a single one to be munged…

data.table::fread("discovery_chr1_KRT6C:P48668:OID30169:v1:Cardiometabolic_II.gz")

chrs <- paste0("chr",1:22) all_chrs_dat <- vector(mode="list",length = length(chrs)) names(all_chrs_dat) <- chrs for(chr_i in chrs){ pthi <- paste0("discovery",chr_i, "_KRT6C:P48668:OID30169:v1:Cardiometabolic_II.gz") tmp <- data.table::fread(pth_i) all_chrs_dat[[chr_i]] <- tmp }

dat <- data.table::rbindlist(all_chrs_dat)

AzSaied commented 2 weeks ago

This website seems to have a lot of transcriptomic data - which I think could be useful in cross referencing / validating the information from the human protein atlas (when it says XXXX protein is not expressed in cell type 'abcd'.

https://maayanlab.cloud/Harmonizome/gene/MSLNL

AzSaied commented 2 weeks ago

If the transcriptomic data doesn't align perfectly with the human protein atlas data, we could take a stringent view, and only define a true negative as one which both sources of data say have zero expression of the protein

AzSaied commented 2 weeks ago

Shortlist of proteins to try: ALPI - 427 zeros. No expression in brain. High expression where it is expressed. FSHB - 503 zeros CABP2 - 499 zeros GAGE2A - 431 zeros - highly expressed in testes. Present in endothelial cells

MSLNL - 429 zeros. no expression in the brain (according to human protein atlas) - which is pretty rare ZNRF4 - 430 zeros. High expression where it is expressed.

AzSaied commented 2 weeks ago

This is a remember to do some more brain proteins