Arcadia-Science / metagenomics

A Nextflow workflow for QC, evaluation, and profiling of metagenomic samples using short- and long-read technologies
MIT License
37 stars 3 forks source link

test datasets #13

Closed elizabethmcd closed 1 year ago

elizabethmcd commented 2 years ago

Generate test datasets for both short read and long read functionalities, host in a different github repo in the Arcadia-Science repo so it doesn't bloat the workflow repo. Full-size datasets for full-size tests can stay on AWS but for rapidly testing needs to be extremely pared down. Also document how the test datasets were generated - ie mapped to reference genomes, get those reads, subset the reads with seqtk etc.

Consider a test dataset for PacBio HiFi reads for metagenomes because in theory shouldn't need the polishing steps that the Nanopore reads might need (although with the newer chemistries you need less polishing, but the processes should still be there)

elizabethmcd commented 2 years ago

Steps for each:

  1. Metagenomics pipeline through assembly/profiling with lineage classification
  2. Manually bin with mmgenome2/anvio etc. for some downstream comparisons with binners (not part of this workflow)
  3. Ideally have at least 1 bacteria, 1 fungus, 1 phage from a sample
  4. Reads from that sample mapped to the concatenated genomes
  5. Extract the reads from the BAM file, subset with seqtk
  6. This is the test dataset and host in a github repo Arcadia-Science/test-datasets - this can serve as the test dataset for both the metagenomics and binning workflows since the latter should be designed to take either raw reads or assemblies with info (coverage table, lineage classification?)
elizabethmcd commented 1 year ago

Once hosted in a public place on S3, has to be listed in a samplesheet because of issue described in #4

elizabethmcd commented 1 year ago

addressed in https://github.com/Arcadia-Science/test-datasets/pull/9 where Nanopore test dataset is validated and passes all checks, same for illumina