Open lingyi-owl opened 3 years ago
Hi,
IMO, an empty file means a failed job - maybe not technically since seqtk apparently exits ok. But this should be somehow made clear, both to the user and in the pipeline logic.
I am trying a refactoring (yet another) to deal with this the snakemake way. I am using snakemake's modules.
My idea is that the whole workflow can be split into several different sub-pipelines.
Step 2,3,4 take as input assembled scaffolds. If you would run any of this independent of (1) you would - or should- know that your input is ok, i.e. it contains something.
This is what I am trying to achieve. The final output of (1) is a new samplesheet of samples with valid fasta files. This will serve as an input to 2,3,4 and a control point where if things go wrong , e.g. none of the samples produces long enough assemblies. If that is the case, then the pipeline will not continue to the rest of the steps and the user must check his input.
I believe this will also make things more modular and one can run any of modules 2,3,4 if they have a collection of fasta files. I am not there yet but this would be the idea.
I will ping you once I have something more concrete we can discuss.
Hi @lingyi-owl ,
I made a start to using module on branch 1-check-empty . For now preprocessing and characterize are included. This is a bit loaded so happy to take you through it once we get the chance.
I will not open a new PR if we don't agree on this new proposed structure.
https://github.com/MGXlab/virus_identification_tools_benchmarking/tree/1-check-empty
Btw, I am using the convention < Issue-number > - < some short descriptive name>
for naming new branches. I believe it is good to keep track of things and discussion relevant per issue/branch.
@papanikos I changed the strategy of running the programs of transeq and hmmsearch so each sample is run separately after length filtering step. Previously, samples were concatenated together before running these programs. The new strategy would make more sense when few samples are added into the workflow in the later phase.
Some samples have no contigs greater than 1500 bp and thus have an empty output file after length filtering step. Now the downstream program (transeq) would run for the rest of the samples and throw an error as below.
It would be better to show an error message like "no contig is greater than 1500bp in XXX sample, so the XXX program neglected this sample..."