Unit tests! - Githubissues

ArtPoon commented 3 years ago

We need 'em

DBecker7 commented 2 years ago

Some unit tests/synthetic data started in 9eeaf51

ArtPoon commented 2 years ago

Eventually I think we'll need to simulate data for a comparison of pipelines against some ground truth

DBecker7 commented 2 years ago

Proposal:

Select from sequences.fasta according to current estimates of relative frequencies (frequency and total count are known).
- I.e. sample names from metadata then use seqtk subseq to extract from sequences.fasta (doesn't work with sequences.fasta.xz, must unzip first).
- I've already created sequences_pangolin.fasta.xz, which only contains sequences with known pangolin lineages (data/get-sequences-with-pangolineage.sh, takes over an hour to run on Rei because of unzipping/zipping).
Extract amplicon regions from sampled fastas, output to a fastq with simulated Phred scores (or just fasta if the scores are not used).
- Simulate coverage from our data.
- At this step, we intentionally lose information about linkage across amplicons.
Randomly sample from this file (without replacement) to simulate degradation / incomplete sampling.
- Assumed degradation will probably be arbitrary, but can help us demonstrate the effect of degradation on case count estimation.

A computationally faster method might be to calculate all mutations from sequences_pangolin (encode_diffs) ahead of time, sample amplicon regions, simulate coverage, then reconstruct the sequence within coverage regions from the reference.

GopiGugan commented 2 years ago

Unit test needed for #46

ArtPoon commented 1 year ago

Please focus on minimap2.py and estimate-freqs.R @SandeepThokala thanks

GopiGugan commented 1 year ago

@SandeepThokala to post coverage of unit tests

PoonLab / gromstole

Unit tests! #12