bystrogenomics / bystro

Natural Language Search and Analysis of High Dimensional Genomic Data
Apache License 2.0
43 stars 14 forks source link

Sprint 11 Task List #456

Closed akotlar closed 4 months ago

akotlar commented 6 months ago

Due date for Sprint 11 - May 16th.

General

Proteomics

Datasets: https://www.synapse.org/#!Synapse:syn53420674.1/datasets/, https://www.synapse.org/#!Synapse:syn31822992/wiki/617907

PRS

Goal for Sprint 11: Have a PRS C+T running through the webapp (with display of results in webapp potentially in sprint 12) Provisional for Sprint 13: Have this deployed to IBDGC (we'll need information from them for what they'll find useful in terms of GWAS summary stats)

Important to IBDGC (and likely other consortiums).

Covariance Matrix Estimation

Overall goal: is to improve network analysis, regressions, clustering, anything that relies on a covariance matrix, and the empirical covariance matrix is not a good estimator, especially in small sample sizes.

Infrastructure

Post IBDGC Tasks

cristinaetrv commented 6 months ago

Notes on "Add automated report generation for the highly consequential variants (with far fewer annotations"

Needs to be easily digestible - either a saved subset of the full,

akotlar commented 5 months ago

2024-04-19 Sprint 10 Retro

Overall what has been accomplished:

Summary for Sprint 11 Work

Proteomics Statistical Methods

  1. Write up summary of performance of domain adaptation (with soft impute) vs TAMPOR, or domain adaptation followed by TAMPOR.
  2. Run network analysis on TMT and SomaScan data
  3. Run QTL analysis
  4. Explore Stanford technique for improved logistic regression performance via matrix decomposition

This is an ambitious list. If they roll over, they roll over to the next sprint

Deliverable that we're aiming at over next 2 sprints: get the work/results in Erik Johnson's and Thomas Wingo's hands.

Proteomics API

  1. SomaScan support (API upload is in)
  2. Improved filtering api function will roll over @akotlar
  3. Re-introduce file labeling @dlin30
  4. Re-introduce FragPipe support @dlin30
  5. Add SomaScan upload support @dlin30
  6. API endpoint support for filtering will roll over @dlin30
    • This involves making a submission plugin for proteomic filtering (and a listener on the bystro side). Goal is by end of sprint, you can use the Bystro protein filtering API from a machine that is not on the cluster, routing the API command through the bystro api server @dlin30

PRS

Nothing was achieved, all work rolls over. @akotlar will take over until Cristina is back, best effort. Expecting that initial PRS solution is done by Sprint 11 end; so delay 3 weeks.

PRS excitement is high from Dave Cutler, Elizabeth Leslie's group (potentially, as informed by Julien, her lead bioinformatic analyst), and IBDGC.

Infrastructure and bystro webapp

Further improvements on hold with the possible exception of migrating from zip file downloads to either tar downloads, an improved/fixed zip download, or individual file downloads rather than zipping

What went well

What didn't

What is 1 thing that we will do differently this sprint.

akotlar commented 5 months ago

2024-04-23

Proteomics Topic Meeting

Austin:

akotlar commented 5 months ago

2024-04-30

Proteomics Topic Meeting

Domain Adaptation:

akotlar commented 5 months ago

2024-05-03

Proteomics topic meeting

ProteomicsPipelineDemonstration.ipynb.zip

akotlar commented 5 months ago

2024-05-07

Proteomics Topic Meeting

Dennis got blocked by annotator installation (to create dev instance); running into installation issues, which are being documented and fixed.

Ilha is working this week on covariance estimation methods:

Alex - on track for proteomics data; initial analysis on 300 sample CSF TMT + SomaScan, then 400 and 900 sample datasets that Thomas/Nick shared.

Austin - will share the Jupyter notebook demonstrating SPPCA on neuroscience data.

Common variant topic meeting

Austin/POE: People have created hypothesis testing for detecting spikes in isotropic covariance matrices. We whiten homozygotes, apply to heterozygotes. We will implement a hypothesis test for detecting a single spike; we know that after you whiten heterozygotes, your covariance matrix will be isotropic with a single spike. This will result in a call and p-value. Then we will focus on singular value shrinkers that give good estimates.

Rare variant topic meeting

Austin is trying to prove rare variant analysis is inherently impossible outside mendelian traits. He is showing that if you have many rare variants, and bound their effects (in terms of P(Disease|variant))...when having any mutation has a tiny effect, the population variance in having disease goes to 0; which is to say everyone has identical risk for having disease.

akotlar commented 5 months ago

2025-05-08

Austin - working on NeurIPS paper Dennis - wrapping up installation guide Cristina - PR'ing PRS today Ilha - close to completing the singular value shrinkers; working on operator norm shrinker that is well suited for large n; genotyping data will use the non-negative covariance matrix estimator Alex - Gotten the 300 sample data decompressed (required 7zip to avoid the "corruption" and refusal to decompress). In comm with Eric Dammers, who has instructed on what the files mean (same naming scheme as the the olink/tmt/somascan paper)

akotlar commented 5 months ago

2025-05-10 Weekly Meeting

Agenda

Discussion

Singular value shrinkers is still WIP - working on a version that handles any sample size PRS - on track Proteomics - behind a few days but will come back on track Infrastructure - CVXPY & scikit-allel in particular presented issues during install on Arm Mac, need to follow up and find a solution (according to https://github.com/cvxpy/cvxpy/issues/2075 this is now resolved)