brentp / smoove

structural variant calling and genotyping with existing tools, but, smoothly.
Apache License 2.0
222 stars 21 forks source link

"Missing pair end parameters:mean stdev" and histo files are empty. #195

Closed ggoodstudydaydayup closed 2 years ago

ggoodstudydaydayup commented 2 years ago

Hello @brentp,

I ran smoove like this --- smoove call --outdir results-smoove/ --name $sample --fasta $reference -p 1 --genotype ../${sample}.bam

For each BAM file, I have a "disc.bam", "disc.bam.bai", "disc.bam.orig.bam" and a "histo" file created. However, the histo files are empty. I check the bam file and there was nothing wrong. Meanwhile, when I ran like this --- samtools view -r ${sample} ${sample}.bam \ | tail -n+100000 \ | pairend_distro.py \ -r 150 \ -X 4 \ -N 10000 \ -o sample.lib1.histo The "histo" file was created and the mean value was calculated.

Could you please give me some advice to solve this problem?

Best wishes, Liu

brentp commented 2 years ago

can you show the full stderr and stdout when you run smoove? make sure to use a clean --outdir.

ggoodstudydaydayup commented 2 years ago

Here is the full stderr and stdout: "[smoove] 2022/04/06 09:46:18 starting with version 0.2.7 [smoove] 2022/04/06 09:46:18 calculating bam stats for 1 bams [smoove] 2022/04/06 09:46:25 done calculating bam stats [smoove]: ([E]lumpy-filter) 2022/04/06 09:52:17 [lumpy_filter] extracted splits and discordants from 292002619 total aligned reads [smoove]:2022/04/06 09:52:25 finished process: lumpy-filter (set -eu; lumpy_filter -f /data/liuyu/Script/SV/Fei_lab/SV/SV_calling/Msie_Msyl/ref/Msyl_Rfor_compare) in user-time:11m29.628743s system-time:49.598735s [smoove] 2022/04/06 10:10:12 removed 110332034 alignments out of 290199195 (38.02%) with low mapq, depth > 1000, or from excluded chroms from GD.disc.bam in 1068 seconds [smoove] 2022/04/06 10:10:12 removed 9147921 alignments out of 290199195 (3.15%) that were bad interchromosomals or flanked-splitters from GD.disc.bam [smoove] 2022/04/06 10:41:00 kept 841340 putative orphans [smoove] 2022/04/06 10:41:00 removed 598349 discordant orphans in 935 seconds [smoove] 2022/04/06 11:00:55 removed 20509144 singletons and isolated interchromosomals of 170719240 reads (12.01%) from GD.disc.bam in 3043 seconds [smoove] 2022/04/06 11:00:55 150210096 reads (51.76%) of the original 290199195 remain from GD.disc.bam [smoove] 2022/04/06 11:02:45 removed 6909921 alignments out of 11301778 (61.14%) with low mapq, depth > 1000, or from excluded chroms from GD.split.bam in 76 seconds [smoove] 2022/04/06 11:02:45 removed 1248549 alignments out of 11301778 (11.05%) that were bad interchromosomals or flanked-splitters from GD.split.bam [smoove] 2022/04/06 11:04:41 kept 457383 putative orphans [smoove] 2022/04/06 11:04:41 removed 34566 split orphans in 78 seconds [smoove] 2022/04/06 11:04:58 removed 1099985 singletons of 3143308 reads (34.99%) from GD.split.bam in 133 seconds [smoove] 2022/04/06 11:04:58 2043323 reads (18.08%) of the original 11301778 remain from GD.split.bam [smoove] 2022/04/06 11:04:59 starting lumpy [smoove] 2022/04/06 11:04:59 wrote lumpy command to results-smoove//GD-lumpy-cmd.sh [smoove] 2022/04/06 11:04:59 writing sorted, indexed file to results-smoove/GD-smoove.genotyped.vcf.gz [smoove] 2022/04/06 11:04:59 excluding variants with all unknown or homozygous reference genotypes [smoove] 2022/04/06 11:04:59 missing pair end parameters:mean stdev Program: ** (v 0.2.13) Author: Ryan Layer (rl6sf@virginia.edu) Summary: Find structural variations in various signals.

Usage: ** [OPTIONS]

Options: -g Genome file (defines chromosome order) [smoove] 2022/04/06 11:04:59 -e Show evidence for each call -w File read windows size (default 1000000) -mw minimum weight for a call -msw minimum per-sample weight for a call -tt trim threshold -x exclude file bed file -t temp file prefix, must be to a writeable directory -P output probability curve for each variant -b output BEDPE instead of VCF -sr bam_file:, id:, back_distance:, min_mapping_threshold:, weight:, [smoove] 2022/04/06 11:04:59 min_clip:, read_group:

    -pe     bam_file:<file name>,
            id:<sample name>,
            histo_file:<file name>,
            mean:<value>,
            stdev:<value>,
            read_length:<length>,
            min_non_overlap:<length>,
            discordant_z:<z value>,
            back_distance:<distance>,
            min_mapping_threshold:<mapping quality>,
            weight:<sample weight>,
            read_group:<string>

    -bedpe  bedpe_file:<bedpe file>,
            id:<sample name>,
            weight:<sample weight>

[smoove] 2022/04/06 11:05:01 > gsort version 0.1.4 [smoove] 2022/04/06 11:05:07 panic: runtime error: invalid memory address or nil pointer dereference [signal SIGSEGV: segmentation violation [smoove] 2022/04/06 11:05:07 code=0x1 addr=0x8 pc=0x546d43] " And the lumpy-cmd.sh like this : "set -euo pipefail set -euo pipefail; lumpy -msw 4 -mw 4 -t $(mktemp) -tt 0 -P -pe id:GD,bam_file:results-smoove//GD.disc.bam,histo_file:results-smoove//GD.histo,mean:0.00,stdev:0.00,read_length:150,min_non_overlap:150,discordant_z:2.75,back_distance:30,weight:1,min_mapping_threshold:20 -sr id:GD,bam_file:results-smoove//GD.split.bam,back_distance:10,weight:1,min_mapping_threshold:20"

Thanks in advance.

brentp commented 2 years ago

are the reads pairs? can you share the output of samtools flagstat?

ggoodstudydaydayup commented 2 years ago

Thanks for your reply. We used pair-end reads. Here is the result of samtools flagstat: 300886963 + 0 in total (QC-passed reads + QC-failed reads) 6062867 + 0 secondary 0 + 0 supplementary 0 + 0 duplicates 292002619 + 0 mapped (97.05% : N/A) 294824096 + 0 paired in sequencing 147412048 + 0 read1 147412048 + 0 read2 0 + 0 properly paired (0.00% : N/A) 284284226 + 0 with itself and mate mapped 1655526 + 0 singletons (0.56% : N/A) 81845926 + 0 with mate mapped to a different chr 23920134 + 0 with mate mapped to a different chr (mapQ>=5)

brentp commented 2 years ago

somehow you have 0 properly paired reads. that's the problem. smoove and lumpy use those to determine the expected range of insert sizes.

ggoodstudydaydayup commented 2 years ago

I'm not sure why the command line of lumpy could be used to create "histo" file. samtools view -r ${sample} ${sample}.bam | tail -n+100000 | pairend_distro.py -r 150 -X 4 -N 10000 -o sample.lib1.histo Is that right?

brentp commented 2 years ago

I don't know about that script. If you want to use smoove, you'll need a bam file that has some reads with a flag indicating proper pair. There's likely something wrong with your data, so I wouldn't try to circumvent this.

ggoodstudydaydayup commented 2 years ago

Thanks!