High FDR on dream synthetic dataset 3

gnarzisi commented 8 years ago

I am experiencing a 26% false discovery rate (FDR) on the DREAM dataset 3 for indels only. I run the tool with default parameters using the "paired variant calling" command from the documentation. For evaluation I used only calls from the output VCF marked as "PASS" and labelled as "Somatic". Figure 6 from the vardict paper also shows about ~16% FDR on snv+indel combined. The rate seems too high.

Is such FDR what is expected on the DREAM dataset 3 when vardict is run with the default parameters?

chapmanb commented 8 years ago

Giuseppe; Thanks for looking into this -- it's always great to have more people working on validations. Here are my validations from running VarDict in bcbio:

DREAM synthetic 4: http://imgur.com/a/gqzwm
DREAM synthetic 3: http://imgur.com/a/qba5k

I'm not running with fully default parameters since I loosen up some to provide better sensitivity than post filter (https://github.com/chapmanb/bcbio-nextgen/blob/master/bcbio/variation/vardict.py#L267) to reduce the introduced false positives from filtering. I could try running a side-by-side comparison of with and without this to get a sense of baseline results without any tweaks if it would help the work you're doing. Thanks again for the discussion and testing.

gnarzisi commented 8 years ago

Thank you Brad. The 26% indel FDR is actually for DREAM set 4 (not 3 as mentioned in my previous post). The sensitivity I get is 62%. In any case, based on your benchmarking, adjusting the p-value threshold (-P parameter) should take care of removing most of the false-positives. I'll give it a try.

AstraZeneca-NGS / VarDict

High FDR on dream synthetic dataset 3 #21