Closed akotlar closed 4 months ago
Notes on "Add automated report generation for the highly consequential variants (with far fewer annotations"
Needs to be easily digestible - either a saved subset of the full,
This is an ambitious list. If they roll over, they roll over to the next sprint
Deliverable that we're aiming at over next 2 sprints: get the work/results in Erik Johnson's and Thomas Wingo's hands.
Nothing was achieved, all work rolls over. @akotlar will take over until Cristina is back, best effort. Expecting that initial PRS solution is done by Sprint 11 end; so delay 3 weeks.
PRS excitement is high from Dave Cutler, Elizabeth Leslie's group (potentially, as informed by Julien, her lead bioinformatic analyst), and IBDGC.
Further improvements on hold with the possible exception of migrating from zip file downloads to either tar downloads, an improved/fixed zip download, or individual file downloads rather than zipping
Use our own code more often. Example: initialization scheme turned out to be critical for supervised PPCA; upload system was undercooked; bug was introduced that prevented jobs from being marked failed leading to "stuck" jobs.
as part of this @akotlar will sit with more users
Austin:
Domain adaptation, and we need complete data, so we need to impute missing values; the 330 TMT/SomaScan data has relatively large missingness. Our Soft Impute CV module, does well, gives around 70% variance explained in imputed data.
Nicole (Austin's wife) is a proteomicist, her thesis was on 6 samples. She separated into transmembrane vs non proteins and dropped missing values
POE & Domain Adaptation: We need matrix methods, deep neural networks aren't the best bet. He has been working on the fact that emprical covariance matrix estimation is not good. Ilha will estimate 15 or so covariance matrix estimation methods. You take a whole bunch of experiments, create a mapping to a common mean and covariance matrix, then future experiments can also be mapped into that space / projected into mapping into that space.
He is also looking at singular value shrinkage methods.
Why he did PPCA: You could put in an option to either plot to the first 2 principal components or the first 2 that have nothing to do with race. This would be useful for Erik Johnson's denoising work.
Domain Adaptation:
ProteomicsPipelineDemonstration.ipynb.zip
Dennis got blocked by annotator installation (to create dev instance); running into installation issues, which are being documented and fixed.
Ilha is working this week on covariance estimation methods:
Alex - on track for proteomics data; initial analysis on 300 sample CSF TMT + SomaScan, then 400 and 900 sample datasets that Thomas/Nick shared.
Austin - will share the Jupyter notebook demonstrating SPPCA on neuroscience data.
Austin/POE: People have created hypothesis testing for detecting spikes in isotropic covariance matrices. We whiten homozygotes, apply to heterozygotes. We will implement a hypothesis test for detecting a single spike; we know that after you whiten heterozygotes, your covariance matrix will be isotropic with a single spike. This will result in a call and p-value. Then we will focus on singular value shrinkers that give good estimates.
Austin is trying to prove rare variant analysis is inherently impossible outside mendelian traits. He is showing that if you have many rare variants, and bound their effects (in terms of P(Disease|variant))...when having any mutation has a tiny effect, the population variance in having disease goes to 0; which is to say everyone has identical risk for having disease.
Austin - working on NeurIPS paper Dennis - wrapping up installation guide Cristina - PR'ing PRS today Ilha - close to completing the singular value shrinkers; working on operator norm shrinker that is well suited for large n; genotyping data will use the non-negative covariance matrix estimator Alex - Gotten the 300 sample data decompressed (required 7zip to avoid the "corruption" and refusal to decompress). In comm with Eric Dammers, who has instructed on what the files mean (same naming scheme as the the olink/tmt/somascan paper)
Singular value shrinkers is still WIP - working on a version that handles any sample size PRS - on track Proteomics - behind a few days but will come back on track Infrastructure - CVXPY & scikit-allel in particular presented issues during install on Arm Mac, need to follow up and find a solution (according to https://github.com/cvxpy/cvxpy/issues/2075 this is now resolved)
Due date for Sprint 11 - May 16th.
General
Proteomics
Datasets: https://www.synapse.org/#!Synapse:syn53420674.1/datasets/, https://www.synapse.org/#!Synapse:syn31822992/wiki/617907
PRS
Goal for Sprint 11: Have a PRS C+T running through the webapp (with display of results in webapp potentially in sprint 12) Provisional for Sprint 13: Have this deployed to IBDGC (we'll need information from them for what they'll find useful in terms of GWAS summary stats)
[x] Add back in AD GWAS summary statistics for hg19 - 2024/05/06 for PR
[x] Add in LD map for clumping in hg38 - @cristinaetrv - 2024/05/06 for PR
[x] Optimizing PRS C+T for performance - @cristinaetrv - 2024/05/06 for PR
[ ] (sprint 12) Need annotation for ancestry in AD stat summary - @cristinaetrv
[ ] Add batch processing for PRS C+T workflow with dosage matrix for memory issues @cristinaetrv - @cristinaetrv - 2024/05/08 for PR
[ ] Automatically launch PRS after ancestry from API server - @akotlar - 2024/05/16 for PR
[ ] (stretch) Take in ancestry PCs as PRS-CS covariates @austinTalbot7241993 @akotlar - April 24
[ ] (stetch) Take in top hit from ancestry, convert to superpop, connect to LD map for corresponding pop for LD clump / expectation is that we will at least have this in progress in sprint 11 @cristinaetrv - 2024/05/16
[ ] (stretch) Add readme for AD GWAS sum stats @cristinaetrv - 2024/05/16
[ ] (stretch) Display basic PRS results in webapp (table with individuals and their score) - @akotlar - 2024/05/16
[ ] (sprint 12) Finish PRS-CS standard way without Langevin Dynamics @austinTalbot7241993
[ ] (sprint 12) Weigh PRS scores by gnomad allele frequencies for specific ancestries and the corresponding ancestry probability:
Beta*dosage - 2*( sum_over_superprop_ancestry { maf_gnomad_in_ancestry * p_ancestry } )
- TBDImportant to IBDGC (and likely other consortiums).
Covariance Matrix Estimation
Overall goal: is to improve network analysis, regressions, clustering, anything that relies on a covariance matrix, and the empirical covariance matrix is not a good estimator, especially in small sample sizes.
Infrastructure
Post IBDGC Tasks