Due date for Sprint 11 - May 16th.

General

[x] Improve installation instructions so that the annotator can be installed without issue on Amazon Linux 2023 - @dlin30 - 2024/05/08

Proteomics

Datasets: https://www.synapse.org/#!Synapse:syn53420674.1/datasets/, https://www.synapse.org/#!Synapse:syn31822992/wiki/617907

[ ] Add support for somascan upload to bystro webapp @dlin30 - April 8/9 - Currently under review by @akotlar
[x] Add support for somascan upload through api - @akotlar - April 5th
[x] Jupyter notebook demonstrating adjusting for batch effects Domain adaptation and imputation on neuroscience data, and in simulation - @austinTalbot7241993
[x] Jupyter notebook demonstrating adjusting for batch effects using Domain adaptation and imputation on ~300 samples TMT + SomaScan data - @akotlar - 2024-05-08
[x] SPPCA version for low sample scenarios - @austinTalbot7241993 Ilha - 2024/05/08
[ ] Generate network analysis results using SPPCA on ~300 sample dataset @akotlar - 2024/05/15
[x] Improve protein abundance filtering using genetic data with the goal of 1) supporting any annotation features that Bystro outputs, (stretch) 2) outputting arrays of structs instead of struct of arrays for multi-field annotations that are requested @akotlar - 2024-05-10
[ ] API endpoint to submit filtering prot jobs @dlin30 - 2024/05/06 for initial PR, 2024/05/08 for merged PR
[x] Finish queue listener code v1 #434 @dlin30 - April 14

PRS

Goal for Sprint 11: Have a PRS C+T running through the webapp (with display of results in webapp potentially in sprint 12) Provisional for Sprint 13: Have this deployed to IBDGC (we'll need information from them for what they'll find useful in terms of GWAS summary stats)

[x] Add back in AD GWAS summary statistics for hg19 - 2024/05/06 for PR
[x] Add in LD map for clumping in hg38 - @cristinaetrv - 2024/05/06 for PR
[x] Optimizing PRS C+T for performance - @cristinaetrv - 2024/05/06 for PR
[ ] (sprint 12) Need annotation for ancestry in AD stat summary - @cristinaetrv
[ ] Add batch processing for PRS C+T workflow with dosage matrix for memory issues @cristinaetrv - @cristinaetrv - 2024/05/08 for PR
[ ] Automatically launch PRS after ancestry from API server - @akotlar - 2024/05/16 for PR
[ ] (stretch) Take in ancestry PCs as PRS-CS covariates @austinTalbot7241993 @akotlar - April 24
[ ] (stetch) Take in top hit from ancestry, convert to superpop, connect to LD map for corresponding pop for LD clump / expectation is that we will at least have this in progress in sprint 11 @cristinaetrv - 2024/05/16
[ ] (stretch) Add readme for AD GWAS sum stats @cristinaetrv - 2024/05/16
[ ] (stretch) Display basic PRS results in webapp (table with individuals and their score) - @akotlar - 2024/05/16
[ ] (sprint 12) Finish PRS-CS standard way without Langevin Dynamics @austinTalbot7241993
[ ] (sprint 12) Weigh PRS scores by gnomad allele frequencies for specific ancestries and the corresponding ancestry probability: Beta*dosage - 2*( sum_over_superprop_ancestry { maf_gnomad_in_ancestry * p_ancestry } ) - TBD

Important to IBDGC (and likely other consortiums).

Covariance Matrix Estimation

Overall goal: is to improve network analysis, regressions, clustering, anything that relies on a covariance matrix, and the empirical covariance matrix is not a good estimator, especially in small sample sizes.

[x] Incorporate singular value shrinkers - Ilha - 2024/05/08
[x] Start testing performance of these shrinkers on gaussian data / rank 1 covariance matrices - Ilha - 2024/05/08
[x] Implement a non-negative covariance matrix estimator, which will be useful for genetic methods where we can assume no negative correlations (like rare variant analysis) - Ilha - 2024/05/16

Infrastructure

[x] Improve Python wheels so that they're deployable on ARM Mac, X86 Mac, Linux - @akotlar - 2024/05/16 for ability to install on Mac.

Post IBDGC Tasks

[x] Improve saving performance - April 5th
[x] (critical) Add ability to whitelist user accounts, accepting only environment-variable specified email addresses
[x] Search as you type (16ms timeout) - "instant search" for more feedback when you don't know what to type
[x] Search for fields in FIlters and Aggregations
[ ] (sprint 13+) Re-runnable chunked uploads for both local and from-s3 uploads to handle network instability
[ ] (sprint 13)(high) Add automated report generation for the highly consequential variants (with far fewer annotations) Toggle for less complex view, download that less complex field view 'Clinical report'
[x] (sprint 12) Quick search buttons - "Clinical variants", "Deleterious Mutations" - 2024/05/16 to get initial mockup
[ ] (sprint 13) Add click on table cells to copy query - TBD
[ ] (sprint 14+) Make AMIs fully restartable, including bystro-annotator
[x] Kick tires on data sharing. IBDGC will be relying on Public/Share
[ ] (sprint 13+) Improve data sharing - give permissions for individual samples and variants - @akotlar
[ ] (sprint 13+) Add vcfAlt and vcfRef annotations (the original ref / alt? or after normalization? if the latter, then vcfPos has to change to the normalized position) (Data management)
[x] (stretch) Autocomplete field queries (May be pushed to Sprint 11)
[ ] (After summer) Share custom synonyms (IBDGC really liked this)
[x] Add deletion 'Are you sure' button
[x] Fix refreshing of search page when ancestry isn't completed yet
[x] New upload system - upload files from s3 in the background

Notes on "Add automated report generation for the highly consequential variants (with far fewer annotations"

Gene name, clinvar, exonicAlelleFunction, siteType, cadd, sampleMaf, gnonomad.genomes/exomes.AF, gnomad.genomes/exomes id, amino acid substitution, codon number
VEP-like CONSEQUENCE - add that
- most severe and canonical - just the most severe

Needs to be easily digestible - either a saved subset of the full,

the processing will be done by an analyst, the end user (Judy) will need to be very digestible

2024-04-19 Sprint 10 Retro

Overall what has been accomplished:

IBDGC deployment has taken most of @akotlar time, resulting in reworked upload system and bug fixes (submissionid / job failure)
@akotlar poteomics tasks will roll over as a result
@cristinaetrv PRS tasks on hold until she is back from sick leave
@austinTalbot7241993 Proteomics: 3 steps that need to happen:
1. Imputing missing values is important (the SomaScan & TMT datasets had many missing values). Normally we mean impute, we have a better strategy. This has been done, needs testing. Done using Soft Impute (matrix decomposition rank-based method): https://github.com/bystrogenomics/bystro/pull/467. We have a cross validation scheme, and Austin has shown that we can explain 70% of variance on the 300 sample dataset. So we can now impute missing values, and that is a requirement for domain adaptation.
2. Domain adaptation: We want covariates to stay in their original space and we want to project new data in. Austin's solution is to make sure 1st and 2nd moments align. That means we need to estimate covariance matrix; it turns out that empirical covariance estimators are bad. He has focused on making respectable covariance estimation in Bystro. Will be used for our FAIRE machine learning method, POE, and for domain adaptation. Has shown that if you don't do this covariance estimation, domain adaptation makes things worse, else if makes things better, reducing discrepancy between datasets by 25%. Future improvements will come from collaboration with Ilha. So we now have a harmonization scheme. https://github.com/bystrogenomics/bystro/pull/465. It may not be as good as TAMPOR, but it will be more interpretable, because the original covariate space is left.
3. Remaining (Sprint 11): Try to do outer join on TMT + SomaScan, rather than just inner join on TMT and inner join on SomaScan.

Summary for Sprint 11 Work

Proteomics Statistical Methods

Write up summary of performance of domain adaptation (with soft impute) vs TAMPOR, or domain adaptation followed by TAMPOR.
Run network analysis on TMT and SomaScan data
Run QTL analysis
Explore Stanford technique for improved logistic regression performance via matrix decomposition

This is an ambitious list. If they roll over, they roll over to the next sprint

Deliverable that we're aiming at over next 2 sprints: get the work/results in Erik Johnson's and Thomas Wingo's hands.

Proteomics API

SomaScan support (API upload is in)
Improved filtering api function will roll over @akotlar
Re-introduce file labeling @dlin30
Re-introduce FragPipe support @dlin30
Add SomaScan upload support @dlin30
API endpoint support for filtering will roll over @dlin30
- This involves making a submission plugin for proteomic filtering (and a listener on the bystro side). Goal is by end of sprint, you can use the Bystro protein filtering API from a machine that is not on the cluster, routing the API command through the bystro api server @dlin30

PRS

Nothing was achieved, all work rolls over. @akotlar will take over until Cristina is back, best effort. Expecting that initial PRS solution is done by Sprint 11 end; so delay 3 weeks.

PRS excitement is high from Dave Cutler, Elizabeth Leslie's group (potentially, as informed by Julien, her lead bioinformatic analyst), and IBDGC.

Infrastructure and bystro webapp

Further improvements on hold with the possible exception of migrating from zip file downloads to either tar downloads, an improved/fixed zip download, or individual file downloads rather than zipping

We currently have an issue with unzipping the big_daly result, on HGCC (but not Mac, other Linux machines), complaining about a possible zip file "bomb". This may be a result of the zip file being large and the unzip program being compiled on x86 not x86-64. "error: invalid zip file with overlapped components (possible zip bomb)". TBD

What went well

Learned a lot: covariance matrix estimation being finicky in finite samples (finance guys: Wolfe and Ledoitte)
Learned a lot on deployment and worked through important large upload issues leading to massively improved upload system. As a result of forcing ourselves to deploy our work to IBDGC, we have pushed the project forward by months.
Adding more tests to webapp to make future improvements to upload system less error prone / us more confident in them.
Got SSPCA paper rough draft to Jarvis Chen at Harvard and proved that it outperforms L1 regularization and is competitive with L2 regularization, increasing the value and breadth of people that will be interested in this. Jarvis has also expressed interest in bringing this to a wide range of students at HSPH.
We sat with 2 users (Chris Tasted at IBDGC and Julien at Emory) as they tried to use our product

What didn't

The tests didn't get done quickly enough. A lot of learning on how to write tests for async code in javascript
It's never great to have bugs, and the upload system simply had not been tested enough.
Whenever there is learning, it means things are harder than expected and there are delays.

What is 1 thing that we will do differently this sprint.

Use our own code more often. Example: initialization scheme turned out to be critical for supervised PPCA; upload system was undercooked; bug was introduced that prevented jobs from being marked failed leading to "stuck" jobs.
as part of this @akotlar will sit with more users

2024-04-23

Proteomics Topic Meeting

Austin:

Domain adaptation, and we need complete data, so we need to impute missing values; the 330 TMT/SomaScan data has relatively large missingness. Our Soft Impute CV module, does well, gives around 70% variance explained in imputed data.
- The SoftImpute CV module selects a regularization parameter based on the observed data
- This simplifies our lives because this means that any statistical module we make can assume no missing data
- You need >30 samples
Nicole (Austin's wife) is a proteomicist, her thesis was on 6 samples. She separated into transmembrane vs non proteins and dropped missing values
POE & Domain Adaptation: We need matrix methods, deep neural networks aren't the best bet. He has been working on the fact that emprical covariance matrix estimation is not good. Ilha will estimate 15 or so covariance matrix estimation methods. You take a whole bunch of experiments, create a mapping to a common mean and covariance matrix, then future experiments can also be mapped into that space / projected into mapping into that space.
- He is also making methods to characterize performance. Will be useful for diagnoses as well.
He is also looking at singular value shrinkage methods.
Why he did PPCA: You could put in an option to either plot to the first 2 principal components or the first 2 that have nothing to do with race. This would be useful for Erik Johnson's denoising work.

2024-04-30

Proteomics Topic Meeting

Domain Adaptation:

Goal is to learn a function that adjusts 1 dataset to match the mean and covariance matrix of the group
Austin recommends that we find the means and covariances of all batches align, most outlier detection depends on first 2 moments. Estimating the variance in high dimensions is difficult, so we need to regularize the covariance matrix. The problem is we don't have enough samples to do the mapping and evaluate performance. There is no way to way to evaluate on real data.
- What would the minimum size be? Several thousand samples.
We have demonstrated that we get good performance on synthetic data.
The data we have is ~330 samples, the same samples. We should ask them for their 900 TMT dataset. 9000 proteins, all brain. Accessing this data is a bit easier because it is less identifiable. Thomas will email Nick and ask to get this.
Have we compared to TAMPOR?
- There is not a good way to compare to TAMPOR. The way to do this in ML is to use cross validation.
How was TAMPOR evaluated? Not very rigorously; e.g. they look at the differential expression signal, and see whether it seems right.
We could intentionally put some outlier point and see if we can detect it. There are 2 modes of running the mass spec, MS2 and MS3. We have a dataset, where they generated the data that had a mixture of MS2 and MS3 (400 samples, 93 or so were MS3). They re-ran the entire dataset, in just MS2

2024-05-03

Proteomics topic meeting

ProteomicsPipelineDemonstration.ipynb.zip

@akotlar is working on adapting this to proteomics data

2024-05-07

Proteomics Topic Meeting

Dennis got blocked by annotator installation (to create dev instance); running into installation issues, which are being documented and fixed.

Ilha is working this week on covariance estimation methods:

This week's work: constraining covariance matrix to have non-negative entries. The expected correlation is -1/sqrt(mutation_rate_product), so slightly negative. This means we introduce many 0's, sparsity.

Alex - on track for proteomics data; initial analysis on 300 sample CSF TMT + SomaScan, then 400 and 900 sample datasets that Thomas/Nick shared.

Austin - will share the Jupyter notebook demonstrating SPPCA on neuroscience data.

Common variant topic meeting

Austin/POE: People have created hypothesis testing for detecting spikes in isotropic covariance matrices. We whiten homozygotes, apply to heterozygotes. We will implement a hypothesis test for detecting a single spike; we know that after you whiten heterozygotes, your covariance matrix will be isotropic with a single spike. This will result in a call and p-value. Then we will focus on singular value shrinkers that give good estimates.

Rare variant topic meeting

Austin is trying to prove rare variant analysis is inherently impossible outside mendelian traits. He is showing that if you have many rare variants, and bound their effects (in terms of P(Disease|variant))...when having any mutation has a tiny effect, the population variance in having disease goes to 0; which is to say everyone has identical risk for having disease.

2025-05-08

Austin - working on NeurIPS paper Dennis - wrapping up installation guide Cristina - PR'ing PRS today Ilha - close to completing the singular value shrinkers; working on operator norm shrinker that is well suited for large n; genotyping data will use the non-negative covariance matrix estimator Alex - Gotten the 300 sample data decompressed (required 7zip to avoid the "corruption" and refusal to decompress). In comm with Eric Dammers, who has instructed on what the files mean (same naming scheme as the the olink/tmt/somascan paper)

2025-05-10 Weekly Meeting

Agenda

Create an item in the task list if the work being undertaken is over 1/2 days of work; help us track new and necessary work that comes up post-sprint planning.

Discussion

Singular value shrinkers is still WIP - working on a version that handles any sample size PRS - on track Proteomics - behind a few days but will come back on track Infrastructure - CVXPY & scikit-allel in particular presented issues during install on Arm Mac, need to follow up and find a solution (according to https://github.com/cvxpy/cvxpy/issues/2075 this is now resolved)

bystrogenomics / bystro

Sprint 11 Task List #456

Due date for Sprint 11 - May 16th.

General

Proteomics

PRS

Covariance Matrix Estimation

Infrastructure

Post IBDGC Tasks

2024-04-19 Sprint 10 Retro

Overall what has been accomplished:

Summary for Sprint 11 Work

Proteomics Statistical Methods

Proteomics API

PRS

Infrastructure and bystro webapp

What went well

What didn't

What is 1 thing that we will do differently this sprint.

2024-04-23

Proteomics Topic Meeting

2024-04-30

Proteomics Topic Meeting

2024-05-03

Proteomics topic meeting

2024-05-07

Proteomics Topic Meeting

Common variant topic meeting

Rare variant topic meeting

2025-05-08

2025-05-10 Weekly Meeting

Agenda

Discussion