test datasets - Githubissues

elizabethmcd commented 2 years ago

Generate test datasets for both short read and long read functionalities, host in a different github repo in the Arcadia-Science repo so it doesn't bloat the workflow repo. Full-size datasets for full-size tests can stay on AWS but for rapidly testing needs to be extremely pared down. Also document how the test datasets were generated - ie mapped to reference genomes, get those reads, subset the reads with seqtk etc.

[x] Test dataset for short illumina reads - from cheese Illumina metagenomes
[x] Test dataset for long Nanopore reads - from Dutton cheese Nanopore metagenomes

Consider a test dataset for PacBio HiFi reads for metagenomes because in theory shouldn't need the polishing steps that the Nanopore reads might need (although with the newer chemistries you need less polishing, but the processes should still be there)

elizabethmcd commented 2 years ago

Steps for each:

Metagenomics pipeline through assembly/profiling with lineage classification
Manually bin with mmgenome2/anvio etc. for some downstream comparisons with binners (not part of this workflow)
Ideally have at least 1 bacteria, 1 fungus, 1 phage from a sample
Reads from that sample mapped to the concatenated genomes
Extract the reads from the BAM file, subset with seqtk
This is the test dataset and host in a github repo Arcadia-Science/test-datasets - this can serve as the test dataset for both the metagenomics and binning workflows since the latter should be designed to take either raw reads or assemblies with info (coverage table, lineage classification?)

elizabethmcd commented 1 year ago

Once hosted in a public place on S3, has to be listed in a samplesheet because of issue described in #4

elizabethmcd commented 1 year ago

addressed in https://github.com/Arcadia-Science/test-datasets/pull/9 where Nanopore test dataset is validated and passes all checks, same for illumina

Arcadia-Science / metagenomics

test datasets #13