[Bug]: workflow skips steps when approx_size_sheet.txt is specified

desireetillo commented 2 years ago

What happened?

Just trying out the workflow on my local machine with the included test data. Works fine without the approx_size_sheet.txt, but when I do specify it, the workflow does not perform any of the data processing or assembly steps (see log output below).

Operating System

macOS

Workflow Execution

EPI2ME Labs desktop application

Workflow Execution - EPI2ME Labs Versions

3.1.5

Workflow Execution - Execution Profile

Docker

Workflow Version

0.2.5

Relevant log output

Checking epi2me-labs/wf-clone-validation ...

done - revision: 944a6f9bf6 [v0.2.5]

N E X T F L O W ~ version 22.04.0

NOTE: Your local project version looks outdated - a different revision is available in the remote repository [18ac15e5e0]

Launching `https://github.com/epi2me-labs/wf-clone-validation` [focused_almeida] DSL2 - revision: 944a6f9bf6 [v0.2.5]

Core Nextflow options

revision : v0.2.5

runName : focused_almeida

containerEngine : docker

launchDir : /Users/tillodc/epi2melabs-data/nextflow

workDir : /Users/tillodc/epi2melabs-data/nextflow/instances/2022-10-11-16-44_wf-clone-validation_Jc34UNvLZcQV9rKM5AknHY/work

projectDir : /Users/tillodc/.nextflow/assets/epi2me-labs/wf-clone-validation

userName : tillodc

profile : standard

configFiles : /Users/tillodc/.nextflow/assets/epi2me-labs/wf-clone-validation/nextflow.config

Input/output options

fastq : /Users/tillodc/Documents/SeqCore/Plasmid/Nanopore_Pilot/test_data/test

sample_sheet : /Users/tillodc/Documents/SeqCore/Plasmid/Nanopore_Pilot/test_data/sample_sheet.txt

out_dir : /Users/tillodc/epi2melabs-data/nextflow/instances/2022-10-11-16-44_wf-clone-validation_Jc34UNvLZcQV9rKM5AknHY/output

primers : /Users/tillodc/.nextflow/assets/epi2me-labs/wf-clone-validation/data/primers.tsv

Reference genome options

host_reference : /Users/tillodc/epi2melabs-data/nextflow/NO_HOST_REF

Advanced Options

approx_size : 0

approx_size_sheet: /Users/tillodc/Documents/SeqCore/Plasmid/Nanopore_Pilot/test_data/approx_size_sheet.txt

min_barcode : 0

max_barcode : 192

!! Only displaying parameters that differ from the pipeline defaults !!

------------------------------------------------------

If you use epi2me-labs/wf-clone-validation for your analysis please cite:

* The nf-core framework

https://doi.org/10.1038/s41587-020-0439-x

Checking fastq input.

Barcoded directories detected.

Checking sample sheet.

WARN: Excluding directories not containing .fastq(.gz) files:

WARN: - /Users/tillodc/Documents/SeqCore/Plasmid/Nanopore_Pilot/test_data/test/barcode04

[15/69d8a4] Submitted process > pipeline:getVersions

[22/71c1ec] Submitted process > pipeline:getParams

[74/0cfe8f] Submitted process > checkSampleSheet

WARN: The number of samplesheet entries (4) does not match the number of barcoded directories (3)

[e1/67eefd] Submitted process > pipeline:inserts

[75/6552cf] Submitted process > output (1)

sarahjeeeze commented 2 years ago

Hi, do the sample_ID's in the approx_size sheet match up with those in the sample_sheet?

desireetillo commented 2 years ago

Hi, yes they appear to. I'm pointing the workflow to the fastqs, the approx_size_sheet.txt, and sample_sheet.txt found in the test data set here: https://github.com/epi2me-labs/wf-clone-validation/tree/master/test_data

desireetillo commented 2 years ago

I don't know too much nextflow, but I think the bug in lines 406-410 of main.nf. When approx size sheet is specified it creates a dictionary approx_size with the structure [sample_id, approx_size]. Then it attempts to create a new dictionary called final_samples, by joining this dictionary with another dictionary named_samples that has a completely different structure: [[type:test_sample, barcode:barcode, sample_id:sample_id], path_to_data]. The join won't work since the two dictionaries don't have a common key.
Anyway, you probably already knew about this, but in just case you didn't! :)

bs-az commented 1 year ago

I tested @desireetillo 's suggestion and can confirm something like this fixes the issue:

--- main.nf     2023-04-05 12:57:48.921601733 +0000
+++ .nextflow/assets/epi2me-labs/wf-clone-validation/main.nf    2023-04-05 12:49:52.668982569 +0000
@@ -430,13 +430,17 @@
         // a single per-sample fastq file
         named_samples = samples.map { it -> return tuple(it[1],it[0])}
         if(params.approx_size_sheet != null) {
+            named_samples_w_key = samples.map { it -> return tuple(it[1]["sample_id"],it[1],it[0]) }
             approx_size = Channel.fromPath(params.approx_size_sheet) \
             | splitCsv(header:true) \
             | map { row-> tuple(row.sample_id, row.approx_size) }
-            final_samples = named_samples.join(approx_size)}
+            not_final_samples = named_samples_w_key.join(approx_size)
+            final_samples = not_final_samples.map { it -> return tuple(it[1],it[2],it[3]) }
+        }
         else {
             final_samples = samples.map  { it -> return tuple(it[1],it[0], params.approx_size) }
         }
         sample_fastqs = combineFastq(final_samples)
         // Optionally filter the data, removing reads mapping to
         // the host or background genome

sarahjeeeze commented 1 year ago

Hi, Thanks for the suggestion, we have update this in a latest release so it should be fixed now.

sarahjeeeze commented 1 year ago

Closing as should no longer be an issues as we no longer use the approx size sheet.

epi2me-labs / wf-clone-validation