to do list - Githubissues

bfairkun commented 2 years ago

[ ] Finish molQTL plotting scripts and plot coverage data for bunch of Colocalized QTL effects
[ ] Add more GWAS. Need to edit some scripts to accommodate binary outcomes for some gwas... see (https://rdrr.io/github/jrs95/hyprcoloc/f/vignettes/hyprcoloc.Rmd), and how to convert summary stats to beta and se for coloc: https://stats.stackexchange.com/questions/327666/convert-or-to-beta-and-find-standard-errors-from-confidence-interval
[ ] Process TF QTLs from Tehranchi... get TF QTL SNPs, with delta PWMs for those that are over known motifs.
[ ] ...consider using delta PWM (for TF or for splice sites) as instrumental variable for estimating effect of TF-->chromatin, or TF-->splicing, or splicing --> chromatin. For this, I may want to get a list of SNPs to consider, (eg sQTL SNPs in splice sites) and quantify all the features I need for that analysis: a deltaPWM, sQTL beta, and chromatin beta based on signal within a window +/- 200bp (or some other window size I will decide based on manual inspection) of the splice site.
[ ] gene-wise or exon-wise, intron-wise meta plots to summarise assays... for example, to verify H3K36me3 is over gene bodies and also enriched over included exons.
[ ] publication quality figures for some figures to describe how chRNA is different... Carlos' heat map... How much enrichment for NMD junctions, how much more coverage over introns, etc.
[ ] re-run gwas colocalizations with either the full geavadis (to ask whether the unique chRNA colocalizations can also be picked up with larger RNA-seq sample size), or with different geuvadis sub-populations (to ask to what extent the different LD structures in the YRI versus the european GWAS affect colocilzation... technically the colocalization model assumes same LD structure).
[ ] 5' and 3' PWM --> Splicing: Each intron limited to -3/+6 nt around 5' or 3' end splicing site (separate runs for each). Run QTLTools in Pemutation cis mode with window = 1. To find sQTLs that fall exactly in the splicing sites. Explore leafcutter and IR approach.
[ ] Pick a few interesting GWAS/molQTL colocalizations for follow-up experiments. Perhaps a unique chRNA colocalization for which we have a very high fine-mapping posterior from hyprcoloc, or that disrupt a splice site or something. Let's get a list of all those potential loci/candidate SNPs, and visually inspect the underlying data for a bunch of them, and then pick promising ones to consider for those experiments.
[ ] Fix normalization of chRNA expression of different RNA types. Merge featureCounts and get RPKM together. Then split.

bfairkun commented 2 years ago

For the 5' and 3'ss PWM and sQTL identification... Here is more detail on what I envision:

Eventually I envision using those sQTL beta estimates along with hQTL beta estimates for the signal around say 200bp flanking the splice site. To reiterate, the goal i envision is to estimate a genome-wide beta effect for splicing-->chromatin. I'm not exactly sure, but I think for that analysis it might be ideal to have for each sQTL splice site, a single sQTL beta and a single hQTL beta. This means that we probably will want to quantify splicing slightly different: I envision the following protocol:

Get vcf of variants around observed splice sites (including un-annotated introns... can use the existing leafcutter count tables to get these)
Quantify splicing in a splice-site centric way... Start with leafcutter chRNA intron excision ratios and for each 5'ss with a variant, sum up the intron excision ratio for all introns that use that 5'ss.3. It would also be useful to quantify splicing using intron retention estimates, since this is somewhat orthogonal and in chRNA it might give us access to more introns that might not have apparent/strong effects using splice junction reads alone.
Do same phenotype normalization procedure we have been using. Use QTLtools to estimate beta with cis-window=0. Estimate FDR. Consider keeping only betas for QTLs with FDR<10%... I'm not actually certain if keeping only FDR<10% is a good idea, since performing testing and also using the betas is sort of double-dipping the data, and might lead to over-estimates of effect sizes (winner's curse effect), which may bias the estimate of splicing-->chromatin. Granted, even if we get a "unbiased" beta, it the meaning of the magnitude will be hard for any reader to really grasp, so I will be very satisfied just to know the sign of beta, so I am also open to just using FDR<10% sQTLs, but I think we should make that decision after exploring the data... For example, looking at the strength of the PWM vs sQTL beta correlation.
Then for each chromatin mark (h3k36me3, h3k4me3, h3k27ac, h3k4me1) quantify signal around a 200bp window (let's pick a reasonable window based on the distribution of counts in various window sizes (eg 100/200/500bp), but keeping in mind that I expect the splicing effects on chromatin to be limited to a few hundred bases.
Use the delta-PWM, sQTL beta, and chromatin beta to do an instrumental variable estimation. Here is the super simple explanation with intuition that basically sums up my knowledge of how we would do it... I think it would be super useful to test whether our splicing-->chromatin estimate significantly deviates from 0 and in which direction.

bfairkun commented 2 years ago

Process TF QTLs from Tehranchi... get TF QTL SNPs, with delta PWMs for those that are over known motifs.

For this I meant to eventually do a similar instrumental variable estimation procedure to test TF-->splicing and TF-->chromatin effects. Tehranchi et al published a supplemental table with effect size and p-values for TF-QTLs, that we will have to lift over to hg38, then identify the TF-QTLs that we believe have direct effects on the TF binding since they intersect the TF motif. Here is the resource I was thinking of grabbing motifs, than I can take the TF-QTLs and search the underlying genome sequence (both ref and alt alleles) for motif matches (fimo software from meme-suite can do this, or the biopython motifs library can do it as well, in which case pysam python library might be useful for efficiently getting reference genome sequence by position). Consider matches for further analysis to get TF beta, and also get sQTL beta (for introns/splice sites within some reasonable sizes window), and chromatin beta and do similar IV analysis.

bfairkun commented 2 years ago

publication quality figures for some figures to describe how chRNA is different... Carlos' heat map... How much enrichment for NMD junctions, how much more coverage over introns, etc.

I like your heat map exactly as it was except that we need to remake with the new data and I think we could have a separate color scale for each class of genes (eg ncRNA) so we can compare expression across classes in some interpretable way (eg the colors could be mapped to logRPKM, with perhaps a separate color scale for each class if it is more visually appealing, since the different classes may have very different expression levels.)
I have a bunch of other simple plots that I have previously made that I will remake with the new chRNA data: Quantifying the fraction of intronic reads, the fraction of splice junction reads mapping uniquely to an annotated NMD isoform (to demonstrate that chRNA captures NMD isoforms more efficiently)

bfairkun commented 2 years ago

gene-wise or exon-wise, intron-wise meta plots to summarise assays... for example, to verify H3K36me3 is over gene bodies and also enriched over included exons.

I was envisioning giving readers a general sense of what these assays measure by making plots with deep tools, to show some characteristic patterns.

bfairkun commented 2 years ago

Pick a few interesting GWAS/molQTL colocalizations for follow-up experiments. Perhaps a unique chRNA colocalization for which we have a very high fine-mapping posterior from hyprcoloc, or that disrupt a splice site or something. Let's get a list of all those potential loci/candidate SNPs, and visually inspect the underlying data for a bunch of them, and then pick promising ones to consider for those experiments.

The hyprcoloc results tables has a column for the top fine mapped snp, and another column for the fraction variability explained by that SNP (which according to the hyprcoloc vignette is like a multi-trait fine-mapping probability). Once we add more gwas to the coloc analyses, let's sift through these variables and pick some candidates for potentially following up. I think the ideal SNP would have a high fine-mapping probability, and have a molecular effect we can easily measure in a directed assay (eg qPCR to measure expression, as opposed to splicing for a rare junction that might be harder to measure). Also ideal would be one that demonstrates some of the new things we measure with chRNA (eg ncRNA) and has an interesting story (eg, maybe a well-mapped effect within a ncRNA that also colocalizes with a eQTL (but not enhancer QTL) for a nearby gene, suggesting cis-regulation of the gene by a ncRNA).

cfbuenabadn commented 2 years ago

Get vcf of variants around observed splice sites (including un-annotated introns... can use the existing leafcutter count tables to get these)

This can be with SNPs falling -3 to +6 bp from the splicing site. An alternative way to do it could be to determine this region in the phenotype bed file we use for QTLTools. E.g., if we have an intron chr10:1000-2000:+, we set the location chr10:990-1010 in the bed file, with a cis window size of 0. We do the same for 3'ss. In general, I think it'd be more tractable than modifying the vcf file.

We will do it this way, without modifying the VCF files.

Quantify splicing in a splice-site centric way... Start with leafcutter chRNA intron excision ratios and for each 5'ss with a variant, sum up the intron excision ratio for all introns that use that 5'ss.3. It would also be useful to quantify splicing using intron retention estimates, since this is somewhat orthogonal and in chRNA it might give us access to more introns that might not have apparent/strong effects using splice junction reads alone.

We will explore two approaches:

Summarize 5'ss (repeat for 3'ss) for each intron cluster from leafcutter. I.e., for each 5'ss, sum the percent usage of all the introns in cluster that use that 5'ss. In case of a cluster that only has alternative 3'ss usage, we'll consider it a constitutive 5'ss and not include it in the analysis.
Same analysis in the proximity of 5'ss/3'ss, but with (IR RPKM / gene RPKM) data, instead of splice junction usage.

bfairkun / ChromatinSplicingQTLs

to do list #6