Closed matthdsm closed 4 years ago
Edit,
Apparently, Peddy works just fine when using the vcf's from the joint
dir, even for single samples.
So I suppose, when running Peddy, we'd need to check the config if joint calling is used and select the correct input dir accordingly.
M
On another note, What are your thoughts on dumping peddy in favor of somalier? Same dev, same function, better performance and more functionality.
Cheers M
Hi Matthias,
somalier is super nice, but it would be a little bit of work to switch over in bcbio. We have peddy incorporated into the multiqc reports, and would need to do that with somalier. So this is a nice to have, IMO. If you're feeling up to it, we definitely accept pull requests though.
Great,
any chance on a fix for the current peddy approach while I look into working on somalier? I'll open a WIP PR in the future for you guys to comment on.
Cheers M
I know I'll probably sound like "that guy", but somalier is neither open source nor free software (in the freedom sense: yes, the source is available, but that doesn't mean anything) and has a non-commercial licensing clause. Therefore IMO it should be handled in a similar way as the old GATK 3 stuff.
Hi Luca,
Definitely not 'that guy', I just assumed the licensing was MIT, but I guess not. That stinks. peddy it is!
Okay, so somalier is no option. How about a fix for my original issue? As it is right now, Peddy doesn't work at all :(
Cheers M
No meaning to be a nag, but can the peddy issue be fixed for runs using joint calling?
Thanks M
Hi Matthias,
Sorry for not responding, yup let me see if we can get this to be fixed.
Great! Thanks! M
Hi Matt,
I can't reproduce this behavior, could you pass along the configuration in ../config/foo.yaml
so I can take a look?
Hi Rory,
I'm using the following config:
#bcbio v1.1.5
---
fc_name:
upload:
dir: ../final
resources:
tmp:
dir: /tmp/bcbio
details:
- analysis: variant2
genome_build: hg38
description:
metadata:
batch:
ped:
algorithm:
aligner: bwa
save_diskspace: true
coverage_interval: regional
mark_duplicates: true
recalibrate: false
realign: false
variantcaller: gatk-haplotype
variant_regions: analysis_regions.bed
jointcaller: gatk-haplotype-joint
effects: vep
effects_transcripts: all
vcfanno: [dbscsnv,dbnsfp]
tools_on:
- vep_splicesite_annotations
- gemini
- coverage_perbase
tools_off:
- gatk4
archive: cram-lossless
# add the path to your files here
files:
The command logs specify the following for peddy:
[2019-12-06T11:23Z] pbsnode-1: /Shared/Software/bin/peddy -p 5 --sites hg38 --plot --prefix /tmp/bcbio/tmpxg9oe44y/Proband_94_1652 /home/projects/bcbio_annotation/exomes/NVQExomes/NVQ_RUN_021/samples_NVQ021-merged/work/gatk-haplotype/D1907329.vcf.gz /home/projects/bcbio_annotation/exomes/NVQExomes/NVQ_RUN_021/samples_NVQ021-merged/work/gatk-haplotype/D1907329.ped 2> /tmp/bcbio/tmpxg9oe44y/run-stderr.log
As you can see, bcbio looks for the sample vcf in the gatk-haplotype
dir. When using joint calling, there are only gvcf files available that dir, which, as far as I know, are not compatible with peddy.
I suppose there should be some kind of trigger that tries looking for the files in the joint
dir when jointcaller
is specified.
Thank Matthias
FYI, I specify a jointcaller batch
and add a custom ped file for each trio. As I said when I opened the issue, I think it's a bit weird bcbio then only looks at the single sample vcf's.
I also already tested the same usecase without using the jointcaller
option, and everything seems to work as it should then:
[2019-12-06T03:43Z] server.pbs.service.consul: /Shared/Software/bin/peddy -p 5 --sites hg38 --plot --prefix /tmp/bcbio/tmpe2kfaxs5/Proband_19_03031 /home/projects/ECS/runs/NVQ_033/samples_NVQ033-merged/work/gatk-haplotype/Proband_19_03031-vepeffects-annotated-nomissingalt-filterSNP-filterINDEL.vcf.gz /home/projects/ECS/runs/NVQ_033/samples_NVQ033-merged/work/gatk-haplotype/Proband_19_03031-vepeffects-annotated-nomissingalt-filterSNP-filterINDEL.ped 2> /tmp/bcbio/tmpe2kfaxs5/run-stderr.log
bcbio uses the correct ped file and analyzes the vcf containing multiple samples.
M
Hi Rory,
Any updates regarding this issue? Thanks M
Thanks, sorry for the delay, the holidays put me behind. I'll have this fixed up next early next week.
Great to hear! Thanks a lot!
M
bump.
is there a way I can help with this? If you point me in the right direction, I could give it a shot.
Thanks again. M
Sorry, I suck so bad. I am working on it right now. Thanks for pinging me.
No problem! We're looking into updating our production version and this is something we REALLY need. Sorry for the nagging :/
Cheers M
Ok, I think this should be all set now. If joint calling is turned on, peddy will use the joint called file rather than the individual sample file. Thanks so much for being patient and sorry this took forever.
Awesome! Thanks a lot!
M
Hi,
We're looking into leaning a bit more on Peddy for QC of our trio analysis. However, I noticed bcbio does not take the defined trio's into account and uses a single sample vcf instead of the joint called vcf.
Point in case: (bcbio-nextgen-commands.log)
Where
Proband_19_06349
(the prefix) is the trio name. Since peddy is kind of useless for single samples, I don't really think this makes much sense. However, I do think this is just a matter of selecting the right files for the job. The vcf's are sourced from thegatk-haplotype
dir, so the files in thejoint
dir are completely neglected.Since bcbio already has systems in place to gather all the resulting vcf's in the
gemini
dir for annotation (regardless of variant calling strategy), I'd suggest sourcing the vcf's for peddy there. The same applies for the ped files, since those are already generated forvcf2db
.TL;DR Peddy should get its vcf inputs from the
gemini
dir instead of thegatk-haplotype
dir, since the execution breaks when using joint calling.Cheers M