bcbio / bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis
https://bcbio-nextgen.readthedocs.io
MIT License
988 stars 354 forks source link

Peddy not running for runs using 'jointcaller' #3024

Closed matthdsm closed 4 years ago

matthdsm commented 4 years ago

Hi,

We're looking into leaning a bit more on Peddy for QC of our trio analysis. However, I noticed bcbio does not take the defined trio's into account and uses a single sample vcf instead of the joint called vcf.

Point in case: (bcbio-nextgen-commands.log)

[2019-11-20T11:34Z] /home/galaxy/bcbio/tools/bin/peddy -p 5 --sites hg38 --plot --prefix /tmp/bcbio/tmp0bup6pya/Proband_19_06349 /home/projects/bcbio_annotation/exomes/MISC/issues_1119/samples_issues1119-merged/work/gatk-haplotype/D1906123.vcf.gz /home/projects/bcbio_annotation/exomes/MISC/issues_1119/samples_issues1119-merged/work/gatk-haplotype/D1906123.ped 2> /tmp/bcbio/tmp0bup6pya/run-stderr.log

Where Proband_19_06349 (the prefix) is the trio name. Since peddy is kind of useless for single samples, I don't really think this makes much sense. However, I do think this is just a matter of selecting the right files for the job. The vcf's are sourced from the gatk-haplotype dir, so the files in the joint dir are completely neglected.

Since bcbio already has systems in place to gather all the resulting vcf's in the gemini dir for annotation (regardless of variant calling strategy), I'd suggest sourcing the vcf's for peddy there. The same applies for the ped files, since those are already generated for vcf2db.

TL;DR Peddy should get its vcf inputs from the gemini dir instead of the gatk-haplotype dir, since the execution breaks when using joint calling.

Cheers M

matthdsm commented 4 years ago

Edit, Apparently, Peddy works just fine when using the vcf's from the joint dir, even for single samples. So I suppose, when running Peddy, we'd need to check the config if joint calling is used and select the correct input dir accordingly.

M

matthdsm commented 4 years ago

On another note, What are your thoughts on dumping peddy in favor of somalier? Same dev, same function, better performance and more functionality.

Cheers M

roryk commented 4 years ago

Hi Matthias,

somalier is super nice, but it would be a little bit of work to switch over in bcbio. We have peddy incorporated into the multiqc reports, and would need to do that with somalier. So this is a nice to have, IMO. If you're feeling up to it, we definitely accept pull requests though.

matthdsm commented 4 years ago

Great,

any chance on a fix for the current peddy approach while I look into working on somalier? I'll open a WIP PR in the future for you guys to comment on.

Cheers M

lbeltrame commented 4 years ago

I know I'll probably sound like "that guy", but somalier is neither open source nor free software (in the freedom sense: yes, the source is available, but that doesn't mean anything) and has a non-commercial licensing clause. Therefore IMO it should be handled in a similar way as the old GATK 3 stuff.

roryk commented 4 years ago

Hi Luca,

Definitely not 'that guy', I just assumed the licensing was MIT, but I guess not. That stinks. peddy it is!

matthdsm commented 4 years ago

Okay, so somalier is no option. How about a fix for my original issue? As it is right now, Peddy doesn't work at all :(

Cheers M

matthdsm commented 4 years ago

No meaning to be a nag, but can the peddy issue be fixed for runs using joint calling?

Thanks M

roryk commented 4 years ago

Hi Matthias,

Sorry for not responding, yup let me see if we can get this to be fixed.

matthdsm commented 4 years ago

Great! Thanks! M

roryk commented 4 years ago

Hi Matt,

I can't reproduce this behavior, could you pass along the configuration in ../config/foo.yaml so I can take a look?

matthdsm commented 4 years ago

Hi Rory,

I'm using the following config:

#bcbio v1.1.5
---
fc_name:
upload:
  dir: ../final
resources:
  tmp:
    dir: /tmp/bcbio
details:
  - analysis: variant2
    genome_build: hg38
    description:
    metadata:
      batch:
      ped:
    algorithm:
      aligner: bwa
      save_diskspace: true
      coverage_interval: regional
      mark_duplicates: true
      recalibrate: false
      realign: false
      variantcaller: gatk-haplotype
      variant_regions: analysis_regions.bed
      jointcaller: gatk-haplotype-joint
      effects: vep
      effects_transcripts: all
      vcfanno: [dbscsnv,dbnsfp]
      tools_on:
        - vep_splicesite_annotations
        - gemini
        - coverage_perbase
      tools_off:
        - gatk4
      archive: cram-lossless
    # add the path to your files here
    files:

The command logs specify the following for peddy:

[2019-12-06T11:23Z] pbsnode-1: /Shared/Software/bin/peddy -p 5 --sites hg38 --plot --prefix /tmp/bcbio/tmpxg9oe44y/Proband_94_1652 /home/projects/bcbio_annotation/exomes/NVQExomes/NVQ_RUN_021/samples_NVQ021-merged/work/gatk-haplotype/D1907329.vcf.gz /home/projects/bcbio_annotation/exomes/NVQExomes/NVQ_RUN_021/samples_NVQ021-merged/work/gatk-haplotype/D1907329.ped 2> /tmp/bcbio/tmpxg9oe44y/run-stderr.log

As you can see, bcbio looks for the sample vcf in the gatk-haplotype dir. When using joint calling, there are only gvcf files available that dir, which, as far as I know, are not compatible with peddy. I suppose there should be some kind of trigger that tries looking for the files in the joint dir when jointcaller is specified.

Thank Matthias

matthdsm commented 4 years ago

FYI, I specify a jointcaller batch and add a custom ped file for each trio. As I said when I opened the issue, I think it's a bit weird bcbio then only looks at the single sample vcf's.

I also already tested the same usecase without using the jointcaller option, and everything seems to work as it should then:

[2019-12-06T03:43Z] server.pbs.service.consul: /Shared/Software/bin/peddy -p 5 --sites hg38 --plot --prefix /tmp/bcbio/tmpe2kfaxs5/Proband_19_03031 /home/projects/ECS/runs/NVQ_033/samples_NVQ033-merged/work/gatk-haplotype/Proband_19_03031-vepeffects-annotated-nomissingalt-filterSNP-filterINDEL.vcf.gz /home/projects/ECS/runs/NVQ_033/samples_NVQ033-merged/work/gatk-haplotype/Proband_19_03031-vepeffects-annotated-nomissingalt-filterSNP-filterINDEL.ped 2> /tmp/bcbio/tmpe2kfaxs5/run-stderr.log

bcbio uses the correct ped file and analyzes the vcf containing multiple samples.

M

matthdsm commented 4 years ago

Hi Rory,

Any updates regarding this issue? Thanks M

roryk commented 4 years ago

Thanks, sorry for the delay, the holidays put me behind. I'll have this fixed up next early next week.

matthdsm commented 4 years ago

Great to hear! Thanks a lot!

M

matthdsm commented 4 years ago

bump.

is there a way I can help with this? If you point me in the right direction, I could give it a shot.

Thanks again. M

roryk commented 4 years ago

Sorry, I suck so bad. I am working on it right now. Thanks for pinging me.

matthdsm commented 4 years ago

No problem! We're looking into updating our production version and this is something we REALLY need. Sorry for the nagging :/

Cheers M

roryk commented 4 years ago

Ok, I think this should be all set now. If joint calling is turned on, peddy will use the joint called file rather than the individual sample file. Thanks so much for being patient and sorry this took forever.

matthdsm commented 4 years ago

Awesome! Thanks a lot!

M