Closed elizabethmcd closed 1 year ago
Steps for each:
seqtk
Arcadia-Science/test-datasets
- this can serve as the test dataset for both the metagenomics and binning workflows since the latter should be designed to take either raw reads or assemblies with info (coverage table, lineage classification?)Once hosted in a public place on S3, has to be listed in a samplesheet because of issue described in #4
addressed in https://github.com/Arcadia-Science/test-datasets/pull/9 where Nanopore test dataset is validated and passes all checks, same for illumina
Generate test datasets for both short read and long read functionalities, host in a different github repo in the Arcadia-Science repo so it doesn't bloat the workflow repo. Full-size datasets for full-size tests can stay on AWS but for rapidly testing needs to be extremely pared down. Also document how the test datasets were generated - ie mapped to reference genomes, get those reads, subset the reads with
seqtk
etc.Consider a test dataset for PacBio HiFi reads for metagenomes because in theory shouldn't need the polishing steps that the Nanopore reads might need (although with the newer chemistries you need less polishing, but the processes should still be there)