EichlerLab / smrtsv2

Structural variant caller
MIT License
53 stars 6 forks source link

Error in rule detect_gaps_annotate_coverage #21

Closed rozaimirazali closed 5 years ago

rozaimirazali commented 5 years ago

Hi. When I run the following command in the terminal manually, it works fine:

bedtools intersect -a detect/gaps/gaps_no_cov_ins.bed -b detect/coverage/coverage.bed -sorted -wao | awk 'OFS="\t" { if ($13 == ".") { $13 = 0 } print }' | cut -f 1-6,8- | bedtools groupby -i stdin -g 1,2,3,4,5,6,7,8 -c 12 -o mean

Could you please explain why I am getting error here?

[Sun Mar 10 16:04:42 2019] rule detect_gaps_annotate_coverage: input: detect/gaps/gaps_no_cov_ins.bed, detect/coverage/coverage.bed output: detect/gaps/gaps_with_cov_ins.bed jobid: 25 wildcards: svtype=ins

[Sun Mar 10 16:04:42 2019] Error in rule detect_gaps_annotate_coverage: jobid: 25 output: detect/gaps/gaps_with_cov_ins.bed

RuleException: CalledProcessError in line 438 of /gpfs/software/genomics/smrtsv2/smrtsv2/rules/detect.snakefile: Command ' set -euo pipefail; bedtools intersect -a detect/gaps/gaps_no_cov_ins.bed -b detect/coverage/coverage.bed -sorted -wao | awk 'OFS="\t" { if ($13 == ".") { $13 = 0 } print }' | cut -f 1-6,8- | bedtools groupby -i stdin -g 1,2,3,4,5,6,7,8 -c 12 -o mean >detect/gaps/gaps_with_cov_ins.bed ' returned non-zero exit status 1. File "/gpfs/software/genomics/smrtsv2/smrtsv2/rules/detect.snakefile", line 438, in __rule_detect_gaps_annotate_coverage File "/gpfs/software/genomics/smrtsv2/smrtsv2/dep/conda/build/envs/python3/lib/python3.6/concurrent/futures/thread.py", line 55, in run Removing output files of failed job detect_gaps_annotate_coverage since they might be corrupted: detect/gaps/gaps_with_cov_ins.bed Shutting down, this might take some time. Exiting because a job execution failed. Look above for error message Complete log: /gpfs/projects/SDR_gref_ymokrab/sv/smrtsv2/analysis/.snakemake/log/2019-03-10T111104.617892.snakemake.log

paudano commented 5 years ago

I see the output from Snakemake, but not from the command itself. Snakemake reports that an error occurred, but the details are not there. If you are running it distrubuted, was that redirected into a cluster log file for this job?

rozaimirazali commented 5 years ago

...redirected into a cluster log file for this job?

Sorry, do you mean the stdout and stderr of LSF?

The stdout file shows:

Detecting variant signatures Running snakemake command:

  • snakemake
  • --snakefile
  • /gpfs/software/genomics/smrtsv2/smrtsv2/rules/detect.snakefile
  • --rerun-incomplete
  • -j
  • 1
  • detect_group_merge_regions
  • --config
  • assembly_window_size=60000
  • assembly_window_slide=20000
  • mapping_quality=30
  • min_length=50
  • min_support=5
  • max_support=100
  • min_coverage=5
  • max_coverage=100
  • min_hardstop_support=11
  • max_candidate_length=60000
  • ld_path=/gpfs/software/genomics/smrtsv2/smrtsv2/dep/lib:/opt/ibm/lsfsuite/lsf/10.1/linux2.6-glibc2.3-x86_64/lib:/gpfs/software/tools/java/1.8.121/lib:/opt/ibm/lsfsuite/ext/ppm/10.2/linux2.6-glibc2.3-x86_64/lib
  • path=/gpfs/software/genomics/smrtsv2/smrtsv2/dep/bin:/opt/ibm/lsfsuite/lsf/10.1/linux2.6-glibc2.3-x86_64/bin:/gpfs/software/genomics/RepeatMasker/4.0.7/bin:/gpfs/software/genomics/smrtsv2/smrtsv2/:/gpfs/software/genomics/smrtsv2/smrtsv2/dep/bin:/gpfs/software/tools/java/1.8.121/bin:/gpfs/software/genomics/pb-falcon/src/bin:/gpfs/home/rmohamadrazali/software/canu-1.8/Linux-amd64/bin:/opt/ibm/lsfsuite/lsf/10.1/linux2.6-glibc2.3-x86_64/etc:/opt/ibm/lsfsuite/ext/ppm/10.2/bin:/opt/ibm/lsfsuite/ext/ppm/10.2/linux2.6-glibc2.3-x86_64/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/gpfs/home/rmohamadrazali/.dotnet/tools:/opt/ibutils/bin:/bin:/usr/bin:/usr/lpp/mmfs/bin:/usr/lpp/mmfs/bin:/gpfs/home/rmohamadrazali/.local/bin:/gpfs/home/rmohamadrazali/bin
  • tempdir=tmp

The stderr file does show Failed to identify candidate regions in the last line, which is not there in the snakemake log file :

[Sun Mar 10 16:03:53 2019] rule detect_gaps_merge_batches: input: detect/gaps/batch/0_ins.bed, detect/gaps/batch/1_ins.bed, detect/gaps/batch/2_ins.bed, detect/gaps/batch/3_ins.bed, detect/gaps/batch/4_ins.bed, detect/gaps/batch/5_ins.bed, detect/gaps/batch/6_ins.bed, detect/gaps/batch/7_ins.bed, detect/gaps/batch/8_ins.bed, detect/gaps/batch/9_ins.bed output: detect/gaps/gaps_no_cov_ins.bed jobid: 29 wildcards: svtype=ins

Removing temporary output file detect/gaps/batch/2_ins.bed. Removing temporary output file detect/gaps/batch/9_ins.bed. Removing temporary output file detect/gaps/batch/5_ins.bed. Removing temporary output file detect/gaps/batch/0_ins.bed. Removing temporary output file detect/gaps/batch/7_ins.bed. Removing temporary output file detect/gaps/batch/1_ins.bed. Removing temporary output file detect/gaps/batch/6_ins.bed. Removing temporary output file detect/gaps/batch/4_ins.bed. Removing temporary output file detect/gaps/batch/8_ins.bed. Removing temporary output file detect/gaps/batch/3_ins.bed. [Sun Mar 10 16:04:42 2019] Finished job 29. 38 of 56 steps (68%) done

[Sun Mar 10 16:04:42 2019] rule detect_gaps_annotate_coverage: input: detect/gaps/gaps_no_cov_ins.bed, detect/coverage/coverage.bed output: detect/gaps/gaps_with_cov_ins.bed jobid: 25 wildcards: svtype=ins

Error: Sorted input specified, but the file detect/coverage/coverage.bed has the following out of order record 10 50 100 0 [Sun Mar 10 16:04:42 2019] Error in rule detect_gaps_annotate_coverage: jobid: 25 output: detect/gaps/gaps_with_cov_ins.bed

RuleException: CalledProcessError in line 438 of /gpfs/software/genomics/smrtsv2/smrtsv2/rules/detect.snakefile: Command ' set -euo pipefail; bedtools intersect -a detect/gaps/gaps_no_cov_ins.bed -b detect/coverage/coverage.bed -sorted -wao | awk 'OFS="\t" { if ($13 == ".") { $13 = 0 } print }' | cut -f 1-6,8- | bedtools groupby -i stdin -g 1,2,3,4,5,6,7,8 -c 12 -o mean >detect/gaps/gaps_with_cov_ins.bed ' returned non-zero exit status 1. File "/gpfs/software/genomics/smrtsv2/smrtsv2/rules/detect.snakefile", line 438, in __rule_detect_gaps_annotate_coverage File "/gpfs/software/genomics/smrtsv2/smrtsv2/dep/conda/build/envs/python3/lib/python3.6/concurrent/futures/thread.py", line 55, in run Removing output files of failed job detect_gaps_annotate_coverage since they might be corrupted: detect/gaps/gaps_with_cov_ins.bed Shutting down, this might take some time. Exiting because a job execution failed. Look above for error message Complete log: /gpfs/projects/SDR_gref_ymokrab/sv/smrtsv2/analysis/.snakemake/log/2019-03-10T111104.617892.snakemake.log Failed to identify candidate regions

paudano commented 5 years ago

SMRT-SV is definitely sorting these input files correctly, so I am not sure why bedtools isn't happy about it. Something is different about using GRC contig names (1, 2, 3...) vs the UCSC names (chr1, chr2, ...). I still can't replicate it with a simple test.

All I can do is try a whole new SMRT-SV with the GRC contig names. It might take me some time to get to this point.

paudano commented 5 years ago

We are going to try replicating this problem with the GRC reference. It's going to take a few days, but if we find a bug, we will push a fix.

wharvey31 commented 5 years ago

Hi @rozaimirazali. I have taken over this ticket, and have been unable to replicate this error with the GRC contig named reference. Would you mind sending over the bed file under detect/coverage/coverage.bed? I want to run some specific bug discovery on it.

The only other theory is that it could be something originating from your bams. Are those publicly available? If so, I would love to take a look. Thanks!

rozaimirazali commented 5 years ago

Hi @rozaimirazali. I have taken over this ticket, and have been unable to replicate this error with the GRC contig named reference. Would you mind sending over the bed file under detect/coverage/coverage.bed? I want to run some specific bug discovery on it.

The only other theory is that it could be something originating from your bams. Are those publicly available? If so, I would love to take a look. Thanks!

Hi. Sorry for the long delay because I was away on holiday. anyway, please find the bed file here - https://we.tl/t-sgrINKxZ7R

paudano commented 5 years ago

This file is clearly not sorted. If I run cut -f1 coverage.bed | uniq -c, then it shows unsorted output.

     10 1
     11 15
      1 1
2050617 15
      1 1
     11 16
      1 1
1807085 16
      1 1
     11 17
      1 1

In this, the left column is the number of consecutive lines, and the right column is the chromosome.

When I sort using the same syntax SMRT-SV uses (sort -k 1,1 -k 2,2n coverage.bed > cov_sorted.bed), it is properly sorted.

4985013 1
2710695 10
2700131 11
2677038 12
2303398 13
2146991 14

If you run the sort command above on a cluster session, do you get sorted or unsorted output (use the uniq command to see if chromosomes are consecutive and in ASCII sort order)?

I usually run sort without a space after -k (e.g. sort -k1,1 -k2,2n). You can try that and see if it makes a difference, but I don't know why it would.

I don't think it would matter, but I added export LC_COLLATE=C to SMRT-SV. If it doesn't sort, try running this export and seeing if it changes the behavior.

I have no idea what else to suggest here. SMRT-SV runs sort, and you are not getting sorted output from it.

paudano commented 5 years ago

Were you able to look at this at all? If you re-run the sort command on the file, do you get sorted output?

rozaimirazali commented 5 years ago

Were you able to look at this at all? If you re-run the sort command on the file, do you get sorted output?

Initially, I sorted the coverage.bed file using

sort -k 1,1 -k 2,2n detect/coverage/coverage.bed> sorted.coverage.bed cut -f1 coverage.sorted.bed | uniq -c > sort.txt

and it does not look like it is sorted. See this file

sort.txt

Then, I sorted with the following:

sort -k1,1V -k2,2n detect/coverage/coverage.bed > sorted2.coverage.bed cut -f1 coverage.sorted.V.bed | uniq -c > v.txt

and I finally got it sorted

v.txt

From man sort:

   -V, --version-sort
          natural sort of (version) numbers within text

Question:

I have replaced the 'old unsorted' coverage.bed with the 'newly sorted' coverage.bed in detect/coverage/ . If I re-run smrtsv2, will it able to identify and use this new sorted bed file?

paudano commented 5 years ago

I think the bigger question is why didn't is sort to begin with? The pipeline uses sort in several places, so you would have to hunt down every call to sort and change it. I also can't guarantee that version sort is going to do the right thing in all these cases.

I would strongly suggest that you go to your IT team and figure out why sort doesn't do a textual sort on that first column. This is a pretty standard Linux tool, and the fact that it's not behaving as expected is concerning.

To answer your question, yes, you could replace the file and SMRT-SV will detect that it changed because of the timestamp difference. Then it will re-run from that step. I think if you do this, I can guarantee it's going to fail downstream unless the problem with sort on this system is worked out.

I am going to close this, but if you need to revisit this problem, you can re-open it.

rozaimirazali commented 5 years ago

I ran a fresh job in a new folder and I got the same error message.

[Wed May 1 19:10:24 2019] rule detect_gaps_annotate_coverage: input: detect/gaps/gaps_no_cov_del.bed, detect/coverage/coverage.bed output: detect/gaps/gaps_with_cov_del.bed jobid: 25 wildcards: svtype=del

Error: Sorted input specified, but the file detect/coverage/coverage.bed has the following out of order record 10 50 100 0 [Wed May 1 19:10:24 2019] Error in rule detect_gaps_annotate_coverage: jobid: 25 output: detect/gaps/gaps_with_cov_del.bed

RuleException: CalledProcessError in line 438 of /gpfs/software/genomics/smrtsv2/smrtsv2/rules/detect.snakefile: Command ' set -euo pipefail; bedtools intersect -a detect/gaps/gaps_no_cov_del.bed -b detect/coverage/coverage.bed -sorted -wao | awk 'OFS="\t" { if ($13 == ".") { $13 = 0 } print }' | cut -f 1-6,8- | bedtools groupby -i stdin -g 1,2,3,4,5,6,7,8 -c 12 -o mean >detect/gaps/gaps_with_cov_del.bed ' returned non-zero exit status 1. File "/gpfs/software/genomics/smrtsv2/smrtsv2/rules/detect.snakefile", line 438, in __rule_detect_gaps_annotate_coverage File "/gpfs/software/genomics/smrtsv2/smrtsv2/dep/conda/build/envs/python3/lib/python3.6/concurrent/futures/thread.py", line 55, in run Removing output files of failed job detect_gaps_annotate_coverage since they might be corrupted: detect/gaps/gaps_with_cov_del.bed Shutting down, this might take some time. Exiting because a job execution failed. Look above for error message Complete log: /gpfs/projects/SDR_gref_ymokrab/sv/smrtsv2/analysis2/.snakemake/log/2019-04-30T083543.793097.snakemake.log Failed to identify candidate regions

I have discussed with the IT team and they are getting the same unsorted output with the following commands:

sort -k 1,1 -k 2,2n detect/coverage/coverage.bed > cov_sorted.bed
cut -f1 cov_sorted.bed | uniq -c > sort.txt

We actually tested with three different OSes - Ubuntu, Redhat and MacOS. All of them requires -V to sort the chromosome properly.

Do you mind to tell me what version of sort and OS you used?

paudano commented 5 years ago

CentOS 6, sort 8.4. I was able to sort your file on an Arch machine with sort 8.3.1.

I wonder if it has something to do with LANG, or "LC" environment variables. Try export LC_ALL=C in a terminal session and then run sort in that session. If that fixes it, then make sure that is exported before running the pipeline.

I can't explain why it wouldn't at least sort the same records together even if they were out of order. Could it be interpreting it as variable-byte encoding?