brentp / smoove

structural variant calling and genotyping with existing tools, but, smoothly.
Apache License 2.0
222 stars 21 forks source link

"Missing pair end parameters:mean stdev" - different read lengths creating problems? #148

Open JSegueni opened 3 years ago

JSegueni commented 3 years ago

Hi,

I'm trying to run smoove using the docker container. I have 12 BAM files with different read length (50 and 86). 0 reads were removed during the filtering phase.

As you can see in the log, I have a similar output as this issue: https://github.com/brentp/smoove/issues/70

However, when doing samtools flagstat, I have 0 issue with my BAM files, everything passes the QC.

In the command file created by smoove, I can read some stuff about read length, back distance (?) and discordant z (?) that I don't understand. Does the difference in read length keep me from getting the results?

Any help would be appreciated :)

Best, Mas.


[smoove] 2021/04/22 15:35:56 starting lumpy [smoove] 2021/04/22 15:35:56 wrote lumpy command to output/1st_SV-lumpy-cmd.sh [smoove] 2021/04/22 15:35:56 writing sorted, indexed file to output/1st_SV-smoove.genotyped.vcf.gz [smoove] 2021/04/22 15:35:56 excluding variants with all unknown or homozygous reference genotypes [smoove] 2021/04/22 15:35:56 missing pair end parameters:mean stdev [smoove] 2021/04/22 15:35:56

Program: ** (v 0.2.13) Author: Ryan Layer (rl6sf@virginia.edu) Summary: Find structural variations in various signals. [smoove] 2021/04/22 15:35:56 Usage: ** [OPTIONS]

Options: -g Genome file (defines chromosome order) [smoove] 2021/04/22 15:35:56 -e Show evidence for each call [smoove] 2021/04/22 15:35:56 -w File read windows size (default 1000000) -mw minimum weight for a call [smoove] 2021/04/22 15:35:56 -msw minimum per-sample weight for a call -tt trim threshold [smoove] 2021/04/22 15:35:56 -x exclude file bed file [smoove] 2021/04/22 15:35:56 -t temp file prefix, must be to a writeable directory -P output probability curve for each variant [smoove] 2021/04/22 15:35:56 -b output BEDPE instead of VCF -sr bam_file:, id:, [smoove] 2021/04/22 15:35:56 back_distance:, min_mapping_threshold:, [smoove] 2021/04/22 15:35:56 weight:, min_clip:, [smoove] 2021/04/22 15:35:56 read_group: [smoove] 2021/04/22 15:35:56 -pe bam_file:, [smoove] 2021/04/22 15:35:56 id:, histo_file:, mean:, stdev:, [smoove] 2021/04/22 15:35:56 read_length:, min_non_overlap:, [smoove] 2021/04/22 15:35:56 discordant_z:, [smoove] 2021/04/22 15:35:56 back_distance:, [smoove] 2021/04/22 15:35:56 min_mapping_threshold:, weight:, [smoove] 2021/04/22 15:35:56 read_group: [smoove] 2021/04/22 15:35:56 -bedpe bedpe_file:, [smoove] 2021/04/22 15:35:56 id:, weight: [smoove] 2021/04/22 15:35:56

[smoove] 2021/04/22 15:35:56 > gsort version 0.0.6 [smoove] 2021/04/22 15:35:59 Traceback (most recent call last): File "/opt/conda/envs/smoove-env/bin/svtyper", line 11, in [smoove] 2021/04/22 15:35:59 load_entry_point('svtyper==0.7.0', 'console_scripts', 'svtyper')() File "/opt/conda/envs/smoove-env/lib/python2.7/site-packages/svtyper/classic.py", line 575, in cli [smoove] 2021/04/22 15:35:59 sys.exit(main()) File "/opt/conda/envs/smoove-env/lib/python2.7/site-packages/svtyper/classic.py", line 568, in main [smoove] 2021/04/22 15:35:59 args.max_ci_dist) File "/opt/conda/envs/smoove-env/lib/python2.7/site-packages/svtyper/classic.py", line 150, in sv_genotype [smoove] 2021/04/22 15:35:59 sample = Sample.from_bam(bam_list[i], num_samp, min_lib_prevalence) [smoove] 2021/04/22 15:35:59 File "/opt/conda/envs/smoove-env/lib/python2.7/site-packages/svtyper/parsers.py", line 665, in from_bam [smoove] 2021/04/22 15:35:59 name = bam.header['RG'][0]['SM'] File "pysam/libcalignmentfile.pyx", line 548, in pysam.libcalignmentfile.AlignmentHeader.getitem [smoove] 2021/04/22 15:35:59 KeyError: 'RG' panic: exit status 1

goroutine 1 [running]: github.com/brentp/smoove/svtyper.check(...) /home/brentp/go/go/src/github.com/brentp/smoove/svtyper/svtyper.go:33 github.com/brentp/smoove/svtyper.Svtyper(0xbd6ec0, 0xc001aa4440, 0x7ffe7303ffdc, 0xd, 0xc0001ac300, 0xc, 0x10, 0x7ffe730402bd, 0x6, 0x7ffe7303ff92, ...) /home/brentp/go/go/src/github.com/brentp/smoove/svtyper/svtyper.go:159 +0x1818 github.com/brentp/smoove/lumpy.Main() /home/brentp/go/go/src/github.com/brentp/smoove/lumpy/lumpy.go:347 +0x44f main.main() /home/brentp/go/go/src/github.com/brentp/smoove/cmd/smoove/smoove.go:121 +0x1ce

brentp commented 3 years ago

are the reads paired? this is the error of interest: missing pair end parameters:mean stdev I'd start over removing the results dir. Re-run and paste the full stdout + stderr here.

JSegueni commented 3 years ago

Thanks for your comment. Some of the reads were paired but I'm only using the R1 for everyone in this run. Should I use the R1 and R2 reads when the reads are paired? Should I only use SE reads? I'm kind of lost as to what is the problem here.

stdout is empty.

I'm uploading stderr and the created sh file. For each BAM file, I have a "disc.bam", "disc.bam.bai", "disc.bam.orig.bam" and a "histo" file created. The histo files are empty and the rest of them are each 3-9Kb.

brentp commented 3 years ago

you'll need paired-end reads to run lumpy.

JSegueni commented 3 years ago

ok thank you

JSegueni commented 3 years ago

Hi, I just read here https://github.com/arq5x/lumpy-sv/issues/110 that I can use lumpy with single-end reads. Is there a way to make that work also with smoove that uses lumpy?

Best, Mas.

brentp commented 3 years ago

Hi, you can run lumpy with just split reads, but I won't support that in smoove. If you want to make a PR to support it (when .disc.bam is empty), then I'd consider it.

JohnUrban commented 2 years ago

I ran into this same error.

It wasn't totally surprising because I was using a data type not exactly expected by smoove -- paired-end HiC data. I have a specific application and know what I'm looking for in the results. Using HiC data this way may generate garbage for others, so I am not necessarily recommending it. I'm sure there are Hi-C-specific SV callers out there (btw can someone recommend one?).

Anyway, with the HiC dataset, >2/3 of the reads were put into the discordant BAM (at least by judging the BAM file size). This is b/c the reads are liable to be in any orientation, and any distance away.

Still - I guess the Smoove pipeline failed to find a mean and stdev on the "concordant reads". And the script it made failed.

The workaround in this case was to edit the script located at:

smoove_out_dir/${NAME}-cmd.sh

Specifically find the mean and stdev parameters, and manually edit them: Old:

mean:0.00,stdev:0.00

New:

mean:300.00,stdev:150.00

Entire context:

pe id:reads,bam_file:smoove/reads-sorted.disc.bam,histo_file:smoove/reads-sorted.histo,mean:300.00,stdev:150.00,read_length:101,min_non_overlap:101,discordant_z:2.75,back_distance:30,weight:1,min_mapping_threshold:20

Where did I get the mean and stdev from? I randomly sampled a million or so reads where both mates mapped to the same sequence, filtered for insert sizes <10kb and/or <1kb, and looked at the median and MAD insert size. The median and MAD would be a baseline, but since uninteresting stuff can still be longer in the HiC data, you can bump those up. Play around. The mean and stdev are a bit wonky and huge b/c of the geometrically (exponentially) distributed insert sizes in HiC data, but that also means using them would be more conservative in finding longish inserts "surprising".

I don't have all the answers. Def can't guarantee quality results. But it is a way to play with the data and push it past the error.

Best,

John

joehagmann commented 2 years ago

Dear John,

thanks for that hint. I had the same issue since I have a lot soft-masked reads. Did you also find a way to resume smoove after editing and running the lumpy command?

Best, Jörg

JohnUrban commented 2 years ago

Apologies Jorg -- I suppose I left my 'solution' open-ended in that regard. Also apologies for the delayed response -- I took last week off.

After editing the script called smoove_out_dir/${NAME}-cmd.sh as I mentioned above, just run the script:

bash smoove_out_dir/${NAME}-cmd.sh

This was all I needed to do. However, I may not be anticipating all possible scenarios. I was calling SVs on a single sample. Heck, it was only a couple weeks ago, but I am already forgetting the specifics of the analysis I did. I could provide you more specifics tomorrow if you think it would be helpful, but for now just try running the edited bash script as above. Perhaps someone else can comment on whether or not this will work on multiple samples, or whether or not it finishes out the pipeline.