bcbio / bcbio.variation.recall

Parallel merging, squaring off and ensemble calling for genomic variants
20 stars 3 forks source link

ensemble caller #14

Open ramaniak opened 7 years ago

ramaniak commented 7 years ago

Hello Brad, I have a question/request about the output from running ensemble variant caller, specifially the format field. At the moment the ensemble vcf file reports the format field from the first file of 'n' where it appears. For example, if I input vcf files from mutect2, strelka, vardict, and muse as my input callers and the variant in question appears in mutect2 and strelka, the format field is reported from mutect2. So, there are cases where the format fields could appear from mutect2 or strelka or vardict (but not muse, if I require the variant to be in at least 2 callers). This implies that there is no uniformity in the format field anymore. Is there any way to fix this so that a specific set of format fields are reported irrespective of the input vcf files? This is probably not the easiest thing to do, but I thought I'd ask you anyway.

thanks Arun

chapmanb commented 7 years ago

Arun; Thanks for starting this discussion. Unfortunately it is quite difficult to normalize these to a single set of input fields, hence the current approach which is the best we can reasonably manage. Most of the format field values are calculated internally in the callers so this would require recalling or otherwise interfacing directly with a variant caller. The ensemble method here is meant to be more lightweight than that so takes the imperfect simplified approach instead. Sorry to not have a good solution but hope this helps explain the current implementation.

ramaniak commented 7 years ago

Thanks, I completely understand. Before writing up a script to do this, I searched the web to not re-invent the wheel and came across this: https://github.com/tjparnell/HCI-Scripts/blob/master/SomaticVariants/update_somaticVCF_attributes.pl

A good start or so it seems

Arun

chapmanb commented 7 years ago

Arun; Thanks for the pointer. We'd definitely have interest in pointing at normalization scripts if you build something based on that starting point. The tricky part is handling all the callers and special cases which is does make a good start on. Thanks again.

ramaniak commented 7 years ago

I agree. Will keep you posted on any updates.

thanks

ramaniak commented 7 years ago

sorry for closing and re-opening. Just realized there was another issue, which might not be relevant to the ensemble calling per se.

Currently, I am using the ensemble caller for somatic calls based on mutect2, muse, vardict, strelka and caveman. I am asking the caller to report any calls in at least 2 variant callers.

the issue I noticed is based on how each of the callers report the tumour and normal format fields. Here is the header field from each of these callers.

**CAVEMAN**
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NORMAL TUMOUR

**Mutect2**
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT TUMOR NORMAL

**Vardict**
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NORMAL TUMOR

**muse** 
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT TUMOR NORMAL

**strelka**
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NORMAL TUMOR 

As you probably notice, mutect2 and muse report TUMOR as the last column whereas Caveman, Vardict and Strelka report Normal in the last column.

ensemble caller also reports the TUMOR field in the last column, but maintains the order as present in the variant callers. Therefore when a variant is seen in, say caveman and vardict, the normal and tumour format fields get switched.

I am not quite sure how to deal with this just yet. Seeing, if I can change the default occurrence of these TUMOUR and NORMAL in the variant callers.

Thanks Arun

chapmanb commented 7 years ago

Arun; You will have to ensure samples have consistent sample ordering prior to feeding into ensemble calling. Thank you for highlighting this requirement. Somatic callers do have different behavior in terms of sample ordering and naming, so this requires some work upstream to normalize. This is done automatically in bcbio (https://github.com/chapmanb/bcbio-nextgen) pipelines so is not a part of the more standalone ensemble calling here. Hope this helps.

ramaniak commented 7 years ago

Hello Brad, Yes, that's what I did. Modified the vcfs for sample ordering before feeding into the bcbio ensemble calling.

Good to know it is automatically done with the pipeline!

thanks Arun

On Fri, Dec 16, 2016 at 6:06 AM, Brad Chapman notifications@github.com wrote:

Arun; You will have to ensure samples have consistent sample ordering prior to feeding into ensemble calling. Thank you for highlighting this requirement. Somatic callers do have different behavior in terms of sample ordering and naming, so this requires some work upstream to normalize. This is done automatically in bcbio (https://github.com/chapmanb/bcbio-nextgen) pipelines so is not a part of the more standalone ensemble calling here. Hope this helps.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/chapmanb/bcbio.variation.recall/issues/14#issuecomment-267571262, or mute the thread https://github.com/notifications/unsubscribe-auth/AFzoBi9A53WNTygHmp549Gw5nKbjjvv_ks5rInDHgaJpZM4LIeah .