dantaki / SV2

Support Vector Structural Variation Genotyper
58 stars 11 forks source link

stringent de novo filter explanation #21

Closed justme66 closed 6 years ago

justme66 commented 6 years ago

Hi, I choose 3 tutorial examples as a parents-child trio to test the de novo filter. I set the trio.ped like this: HG00096 HG00096 HG01051 HG00268 1 HG00096 HG01051 0 0 1 HG00096 HG00268 0 0 2 I copied HG00096, HG01051, HG00268 features to dir sv2_features/. I got a genotype matrix, using the following script: sv2 -feats sv2_features/ -v 1kgp_phase3_tutorial.vcf -p trio.ped -o trio_sv2 I have some questions about the output:

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT HG00096 HG00268 HG01051

1 16151682 YL_CN_FIN_64 . 29 PASS END=16155632;SVTYPE=DEL;SVLEN=3951;DENOVO_FILTER=PASS;REF_GTL=17;AF=0.167;CYTOBAND=1p36.21;REPEATMASKER=NA,0;1000G_ID=YL_CN_FIN_64;1000G_OVERLAP=1.00;DESCRIPTION=1:16151682-16155632_deletion;GENES=intergenic;ABPARTS=0;CENTROMERE=0;GAP=0;SEGDUP=0;STR=0.178;UNMAPPABLE=0.006 GT:CN:PE:SR:SC:NS:HA:NH:SQ:GL 0/0:2.28:0.00:0.00:0.72:7:nan:0:19:0.99,0.01,0.00 0/1:1.02:0.06:0.02:0.26:7:nan:0:29:0.00,0.99,0.00 0/0:1.63:0.00:0.00:0.72:42:0.62:37:15:0.97,0.03,0.00

1) QUAL means the median Phred-adjusted ALT genotype likelihood score. In the above number, is QUAL=29 from median of 19,29,15, and the adjusted? 2)Does FILTER=PASS represent 29 > 8 which is ALT cutoff in supplementary table S4? 3) When I want to find the de novo SV in a trio, should I only put the parents-child into one genotyped vcf as the above output shows? Does the INFO/DENOVO_FILTER=PASS mean both REF_GTL and QUAL are larger than REF cutoff and ALT cutoff in table S4? If so, the child genotype is 0/0, parents are 0/1, 0/0. It isn't supposed be do novo SV in this case. My key point is, what parameters should I use to extract the de novo SVs? 4) REF_GTL represents median REF genotype likelihood score. Does this score also be adjusted and come from the list samples in one genotyped.vcf? If I combine different samples in one vcf, the QUAL and REF_GTL will change, right?

Thanks for this brilliant tool.

justme66 commented 6 years ago
  1. When you do the de novo filter, how do you use the .ped input file? does that mean I can put a lot samples from different families into one genotyped.vcf, if the .ped file could be clear about the pedigree of families/trios?
dantaki commented 6 years ago
  1. QUAL is the median Phred-adjusted for the ALT genotypes. Only one sample has an ALT allele (1), thus the score the 29

  2. DENOVO_FILTER is a more stringent filter than FILTER, that was designed for detection of de novo SVs. If it's PASS then a given variant is confidently genotyped. It comes at the cost of decreased sensitivity.

Technically if a child is 0/0 and parents are 0/1, 0/0 the mutation is de novo because reversions to the ancestral allele are possible. De novo mutations are technically Mendelian errors. However, if you are interested in mutations that are not reversions and/or medically relevant you pretty much want to parse out all 0/1 (child) 0/0 (parents) calls.

  1. Yes. It's Phred-adjusted and will be more accurate the more samples you provide.

  2. The ped file is mainly used to match IDs in BAMs and VCFs and to determine the gender. SV2 does not flag putative de novos, but provides a filter for them. You'll have to write a script to parse out the de novo calls or use plink to get the Mendelian errors.