Test dataset config - Githubissues

elizabethmcd commented 1 year ago

This pull request starts the process of adding a test dataset and configuration. The subsampled test files are location in the Arcadia-Science/test-datasets repo at https://github.com/Arcadia-Science/test-datasets/tree/main/cheese-illumina-metagenomes. The test config uses a low amount of CPUS/memory to test the overall workflow.

Currently the workflow can be tested locally with nextflow run main.nf -profile test,docker --input ../test-datasets/cheese-illumina-metagenomes/"*_{1,2}.fq.gz" --outdir test where the test fastqs are local because calling from a github repo requires them to be listed in a CSV and I haven't added support for the samplesheet_csv module yet. With this test the workflow runs on 2 subset samples in ~7 minutes:

N E X T F L O W  ~  version 22.10.0
Launching `main.nf` [trusting_majorana] DSL2 - revision: 5fa88c2472
executor >  local (10)
[76/c13c37] process > METAGENOMICS:METAGENOMICS_SR:FASTP (vir_1_subsampled)              [100%] 2 of 2 ✔
[8c/81871f] process > METAGENOMICS:METAGENOMICS_SR:SPADES (vir_1_subsampled)             [100%] 2 of 2 ✔
[08/a077b4] process > METAGENOMICS:METAGENOMICS_SR:MAPPING_DEPTH:BOWTIE2_ASSEMBLY_BUI... [100%] 2 of 2 ✔
[d6/c87968] process > METAGENOMICS:METAGENOMICS_SR:MAPPING_DEPTH:BOWTIE2_ASSEMBLY_ALI... [100%] 2 of 2 ✔
[24/d8b178] process > METAGENOMICS:METAGENOMICS_SR:MAPPING_DEPTH:METABAT2_JGISUMMARIZ... [100%] 2 of 2 ✔
Completed at: 21-Oct-2022 10:28:56
Duration    : 7m 13s
CPU hours   : 0.3
Succeeded   : 10

To fix in a later pull request(s):

Add support for inputting a samplesheet CSV, as this will also help keep better track of what samples the workflow was run on
Add Github CI to test with the subset samples

taylorreiter commented 1 year ago

I think this is fine for now, but usually test data is packaged with the software...what was your thinking on keeping it separate? Also I know you're working in a strict framework and maybe this is unnecessary but in case it's useful, here's an example of a larger test data set getting downloaded for CI:

https://github.com/spacegraphcats/spacegraphcats/blob/latest/Makefile https://github.com/spacegraphcats/spacegraphcats/blob/latest/.github/workflows/test.yml#L52

I'll wait to approve until I have a better understanding of your holistic approach to testing here

elizabethmcd commented 1 year ago

The nf-core pipelines suggest keeping the physical test datasets separate from the workflow so as not to bloat the repo for the workflow. I don't know if it causes more or less confusion by doing that. It would definitely make this intermediate solution for the --input parameter easier since it would just be referring directly to the test dataset directory within the workflow repo.

My approach and motivation for testing currently is just to make sure the workflow works and data gets through successfully. I don't think the stub profile has full functionality yet so to do this I need a mini test dataset as nf-core suggests. I could modify this pull request putting the test data directly in this repo, as then it's better contained. However, I think nf-core's reasoning for putting test data on github/S3 is when you run nextflow run Arcadia-Science/metagenomics -profile test,docker I don't know if then it can refer to test data directly in the workflow repo still

Arcadia-Science / metagenomics

Test dataset config #16