Hoohm / dropSeqPipe

A SingleCell RNASeq pre-processing snakemake workflow
Creative Commons Attribution Share Alike 4.0 International
147 stars 47 forks source link

Error in plot_yield, probably due to pigz bug in bbmap logs #76

Closed seb-mueller closed 5 years ago

seb-mueller commented 5 years ago

Running the new develop branch, I get get the following error:

localrule plot_yield:
...

Rscript /home/sm934/analysis/dropseq/data/project/.snakemake/scripts/tmpsdzskbpk.plot_yield.R
Activating conda environment: /home/sm934/.conda/myevns/5e8327d1
Conda environment defines Python version < 3.5. Using Python of the master process to execute script. Note that this cannot be avoided, because the script uses data structures from Snakemake which are Python >=3.5 only.
/applications/anaconda/anaconda3/bin/python /home/sm934/analysis/dropseq/data/project/.snakemake/scripts/tmp587a27g7.wrapper.py
Activating conda environment: /home/sm934/.conda/myevns/7a468653
[Mon Feb 25 00:44:34 2019]
Error in rule plot_yield:
    jobid: 14
    output: results/plots/yield.pdf
    conda-env: /home/sm934/.conda/myevns/5e8327d1

RuleException:
CalledProcessError in line 218 of /home/sm934/analysis/dropseq/software/dropSeqPipe/rules/map.smk:
Command 'source activate /home/sm934/.conda/myevns/5e8327d1; set -euo pipefail;  Rscript /home/sm934/analysis/dropseq/data/project/.snakemake/scripts/tmpsdzskbpk.plot_yield.R ' returned non-zero exit status 1.
  File "/home/sm934/analysis/dropseq/software/dropSeqPipe/rules/map.smk", line 218, in __rule_plot_yield
  File "/applications/anaconda/anaconda3/lib/python3.6/concurrent/futures/thread.py", line 55, in run

After a bit of debugging, I found this was caused by only 1 out of 3 samples. Turns out there are unexpected lines in the logs/bbmap/sample1_repair.txt which is read in by plot_yield.R:

java -ea -Xmx30g -cp /home/sm934/.conda/myevns/c77f6b18/opt/bbmap-38.22-0/current/ jgi.SplitPairsAndSingles rp -Xmx30g in=results/samples/sample1/trimmmed_R1.fastq.gz in2=results/samples/sample1/trimmmed_R2.fastq.gz out1=results/samples/sample1/trimmmed_repaired_R1.fastq.gz out2=results/samples/sample1/trimmmed_repaired_R2.fastq.gz repair=t threads=4
Executing jgi.SplitPairsAndSingles [rp, -Xmx30g, in=results/samples/sample1/trimmmed_R1.fastq.gz, in2=results/samples/sample1/trimmmed_R2.fastq.gz, out1=results/samples/sample1/trimmmed_repaired_R1.fastq.gz, out2=results/samples/sample1/trimmmed_repaired_R2.fastq.gz, repair=t, threads=4]

Set threads to 4
Set INTERLEAVED to false
Started output stream.
pigz: skipping: results/samples/sample1/trimmmed_R2.fastq.gz: corrupted -- invalid deflate data (invalid stored block lengths)
pigz: abort: internal threads error

Input:                      32282829 reads      1932313018 bases.
Result:                     32282829 reads (100.00%)    1932313018 bases (100.00%)
Pairs:                      31516718 reads (97.63%)     1899146948 bases (98.28%)
Singletons:                 766111 reads (2.37%)    33166070 bases (1.72%)

Whereas this is for sample2_repair.txt which seems fine.

java -ea -Xmx30g -cp /home/sm934/.conda/myevns/c77f6b18/opt/bbmap-38.22-0/current/ jgi.SplitPairsAndSingles rp -Xmx30g in=results/samples/sample2/trimmmed_R1.fastq.gz in2=results/samples/sample2/trimmmed_R2.fastq.gz out1=results/samples/sample2/trimmmed_repaired_R1.fastq.gz out2=results/samples/sample2/trimmmed_repaired_R2.fastq.gz repair=t threads=4
Executing jgi.SplitPairsAndSingles [rp, -Xmx30g, in=results/samples/sample2/trimmmed_R1.fastq.gz, in2=results/samples/sample2/trimmmed_R2.fastq.gz, out1=results/samples/sample2/trimmmed_repaired_R1.fastq.gz, out2=results/samples/sample2/trimmmed_repaired_R2.fastq.gz, repair=t, threads=4]

Set threads to 4
Set INTERLEAVED to false
Started output stream.

Input:                      139102905 reads         8696092621 bases.
Result:                     139102905 reads (100.00%)   8696092621 bases (100.00%)
Pairs:                      136437092 reads (98.08%)    8571461417 bases (98.57%)
Singletons:                 2665813 reads (1.92%)   124631204 bases (1.43%)

Seems like one of the files couldn't be decompressed properly with pigz. Not sure if this is severe enough to contact the pigz developers, but others have seen this error before: https://github.com/madler/pigz/issues/44

I'll try to debug some more, but I suppose to add some sort of error handling if that happens for other data in the future either in plot_adapter.R or ideally fixing the pigz to not give that error in the first place.

seb-mueller commented 5 years ago

Having done some more research, it seems both trimmmed_repaired_R1.fastq.gz and trimmmed_repaired_R2.fastq.gz were produced with the same number of lines despite the pigz: abort. This is in line with the above, with the stats being reported for both samples.

As a note, the culprit call invoking pigz seems repair.sh:

repair.sh        -Xmx30g        in=results/samples/sample21/trimmmed_R1.fastq.gz        in2=results/samples/sample21/trimmmed_R2.fastq.gz        out1=results/samples/sample21/trimmmed_repaired_R1.fastq.gz        out2=results/samples/sample21/trimmmed_repaired_R2.fastq.gz        repair=t        threads=4 2> results/logs/bbmap/sample21_repair.txt
Activating conda environment: /home/sm934/.conda/myevns/c77f6b18

I'm inclined to have plot_yield.R to deal with those warnings, any objections?

Hoohm commented 5 years ago

So there is a warning but it doesn't change the output?

If so, just discarding the "pigz" lines in plot_yield should do the trick yes.

seb-mueller commented 5 years ago

Yes, this should do the trick, I'll get on it next thing. Still, there might still be a underlying issue which might or might not have effects on the processing, that why I provided all this informations to be aware of a potential problem. Still not sure if it's sample specific or a bug in pigz. I'd therefore advice to inspect the logs/bbmap/ files in case something is odd in the future.

seb-mueller commented 5 years ago

The fix works for, so closing the issue.