ignore samples with no contigs greater than 1500bp in transeq, minimap and wtp

lingyi-owl commented 3 years ago

@papanikos I changed the strategy of running the programs of transeq and hmmsearch so each sample is run separately after length filtering step. Previously, samples were concatenated together before running these programs. The new strategy would make more sense when few samples are added into the workflow in the later phase.

Some samples have no contigs greater than 1500 bp and thus have an empty output file after length filtering step. Now the downstream program (transeq) would run for the rest of the samples and throw an error as below.

It would be better to show an error message like "no contig is greater than 1500bp in XXX sample, so the XXX program neglected this sample..."

Translate nucleic acid sequences
Translate nucleic acid sequences
Translate nucleic acid sequences
Error: Unable to read sequence 'results/microbial/SRR5665121/scaffolds/SRR5665121_scaffolds_gt1500.fasta'
Died: transeq terminated: Bad value for '-sequence' and no prompt
[Sat Aug 28 23:42:59 2021]
Error in rule transeq:
    jobid: 40
    output: results/microbial/SRR5665121/scaffolds/SRR5665121_transeq.fasta
    log: logs/microbial/SRR5665121/transeq/SRR5665121.transeq.log (check log file(s) for error message)
    conda-env: /net/phage/linuxhome/mgx/people/lingyi/fecal_snakemake_test/.snakemake/conda/688f8d31388dbe802b1aa9aa3fa56ab5
    shell:
        transeq -sequence results/microbial/SRR5665121/scaffolds/SRR5665121_scaffolds_gt1500.fasta -outseq results/microbial/SRR5665121/scaffolds/SRR5665121_transeq.fasta -frame 6 -clean
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /net/phage/linuxhome/mgx/people/lingyi/fecal_snakemake_test/.snakemake/log/2021-08-28T234007.151332.snakemake.log

papanikos commented 3 years ago

Hi,

IMO, an empty file means a failed job - maybe not technically since seqtk apparently exits ok. But this should be somehow made clear, both to the user and in the pipeline logic.

I am trying a refactoring (yet another) to deal with this the snakemake way. I am using snakemake's modules.

My idea is that the whole workflow can be split into several different sub-pipelines.

preprocessing + assembly
characterization
identification
clustering

Step 2,3,4 take as input assembled scaffolds. If you would run any of this independent of (1) you would - or should- know that your input is ok, i.e. it contains something.

This is what I am trying to achieve. The final output of (1) is a new samplesheet of samples with valid fasta files. This will serve as an input to 2,3,4 and a control point where if things go wrong , e.g. none of the samples produces long enough assemblies. If that is the case, then the pipeline will not continue to the rest of the steps and the user must check his input.

I believe this will also make things more modular and one can run any of modules 2,3,4 if they have a collection of fasta files. I am not there yet but this would be the idea.

I will ping you once I have something more concrete we can discuss.

papanikos commented 3 years ago

Hi @lingyi-owl ,

I made a start to using module on branch 1-check-empty . For now preprocessing and characterize are included. This is a bit loaded so happy to take you through it once we get the chance.

I will not open a new PR if we don't agree on this new proposed structure.

https://github.com/MGXlab/virus_identification_tools_benchmarking/tree/1-check-empty

Btw, I am using the convention < Issue-number > - < some short descriptive name> for naming new branches. I believe it is good to keep track of things and discussion relevant per issue/branch.

MGXlab / virus_identification_tools_benchmarking

ignore samples with no contigs greater than 1500bp in transeq, minimap and wtp #3