PoonLab / gromstole

Quantifying SARS-CoV-2 VoCs from NGS data of wastewater samples
MIT License
3 stars 5 forks source link

Unit tests! #12

Open ArtPoon opened 3 years ago

ArtPoon commented 3 years ago

We need 'em

DBecker7 commented 2 years ago

Some unit tests/synthetic data started in 9eeaf51

ArtPoon commented 2 years ago

Eventually I think we'll need to simulate data for a comparison of pipelines against some ground truth

DBecker7 commented 2 years ago

Proposal:

  1. Select from sequences.fasta according to current estimates of relative frequencies (frequency and total count are known).
    • I.e. sample names from metadata then use seqtk subseq to extract from sequences.fasta (doesn't work with sequences.fasta.xz, must unzip first).
    • I've already created sequences_pangolin.fasta.xz, which only contains sequences with known pangolin lineages (data/get-sequences-with-pangolineage.sh, takes over an hour to run on Rei because of unzipping/zipping).
  2. Extract amplicon regions from sampled fastas, output to a fastq with simulated Phred scores (or just fasta if the scores are not used).
    • Simulate coverage from our data.
    • At this step, we intentionally lose information about linkage across amplicons.
  3. Randomly sample from this file (without replacement) to simulate degradation / incomplete sampling.
    • Assumed degradation will probably be arbitrary, but can help us demonstrate the effect of degradation on case count estimation.

A computationally faster method might be to calculate all mutations from sequences_pangolin (encode_diffs) ahead of time, sample amplicon regions, simulate coverage, then reconstruct the sequence within coverage regions from the reference.

GopiGugan commented 2 years ago

Unit test needed for #46

ArtPoon commented 1 year ago

Please focus on minimap2.py and estimate-freqs.R @SandeepThokala thanks

GopiGugan commented 1 year ago

@SandeepThokala to post coverage of unit tests