fritzsedlazeck / SURVIVOR

Toolset for SV simulation, comparison and filtering
MIT License
354 stars 47 forks source link

SURVIVOR filter based on AF has no effect #96

Open oleraj opened 5 years ago

oleraj commented 5 years ago

Hi,

I have Manta VCFs >100 individuals that I've merged using SURVIVOR merge with this command:

SURVIVOR merge sample_files.txt 1000 1 1 1 0 0 Manta_merged.vcf 

Then I tried to filter based on AF using this command:

SURVIVOR filter Manta_merged.vcf NA -1 -1 0.10 10 Manta_merged.filt.AF10.vcf

OR

SURVIVOR filter Manta_merged.vcf NA -1 -1 1.00 10 Manta_merged.filt.AF100.vcf

However, it seems there is no difference in the output. I think the filter is not working. Both have the same number of variants:

wc -l Manta_merged.filt.AF10.vcf Manta_merged.filt.AF100.vcf 
     64111 Manta_merged.filt.AF10.vcf
     64111 Manta_merged.filt.AF100.vcf

I also noticed that SURVIVOR merge doesn't add an AF field to the VCF. Is it supposed to add this?

Header:

##ALT=<ID=DEL,Description="Deletion">
##ALT=<ID=DUP,Description="Duplication">
##ALT=<ID=INV,Description="Inversion">
##ALT=<ID=BND,Description="Translocation">
##ALT=<ID=INS,Description="Insertion">
##INFO=<ID=CIEND,Number=2,Type=String,Description="PE confidence interval around END">
##INFO=<ID=CIPOS,Number=2,Type=String,Description="PE confidence interval around POS">
##INFO=<ID=CHR2,Number=1,Type=String,Description="Chromosome for END coordinate in case of a translocation">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the structural variant">
##INFO=<ID=MAPQ,Number=1,Type=Integer,Description="Median mapping quality of paired-ends">
##INFO=<ID=RE,Number=1,Type=Integer,Description="read support">
##INFO=<ID=IMPRECISE,Number=0,Type=Flag,Description="Imprecise structural variation">
##INFO=<ID=PRECISE,Number=0,Type=Flag,Description="Precise structural variation">
##INFO=<ID=SVLEN,Number=1,Type=Integer,Description="Length of the SV">
##INFO=<ID=SVMETHOD,Number=1,Type=String,Description="Method for generating this merged VCF file.">
##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of the SV.">
##INFO=<ID=SUPP_VEC,Number=1,Type=String,Description="Vector of supporting samples.">
##INFO=<ID=SUPP,Number=1,Type=String,Description="Number of samples supporting the variant">
##INFO=<ID=STRANDS,Number=1,Type=String,Description="Indicating the direction of the reads with respect to the type and breakpoint.">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=PSV,Number=1,Type=String,Description="Previous support vector">
##FORMAT=<ID=LN,Number=1,Type=Integer,Description="predicted length">
##FORMAT=<ID=DR,Number=2,Type=Integer,Description="# supporting reference,variant reads in that order">
##FORMAT=<ID=ST,Number=1,Type=String,Description="Strand of SVs">
##FORMAT=<ID=QV,Number=1,Type=String,Description="Quality values: if not defined a . otherwise the reported value.">
##FORMAT=<ID=TY,Number=1,Type=String,Description="Types">
##FORMAT=<ID=ID,Number=1,Type=String,Description="Variant ID from input.">
##FORMAT=<ID=RAL,Number=1,Type=String,Description="Reference allele sequence reported from input.">
##FORMAT=<ID=AAL,Number=1,Type=String,Description="Alternative allele sequence reported from input.">
##FORMAT=<ID=CO,Number=1,Type=String,Description="Coordinates">

Any other suggestions for filtering by AF?

Thanks!

Andrew

fritzsedlazeck commented 5 years ago

Hi Andrew, the easier first: SURVIVOR filter looks for the AF tag. Thus, it wont work if that is not there.

SURVIVOR merge currently does not extend the VCF by an AF tag. Its a nice idea to include it and code is there. I just dont take the genotype into account so it will be the frequency of samples. Would that be ok?

Thanks Fritz

oleraj commented 5 years ago

Hi Fritz,

I'm not sure why you're not able to take genotype into account -- is it because the genotype calls from different SV callers are not consistent or trustworthy?

An alternative I'm thinking I could use the SUPP tag for filtering using bcftools. However, the data type is not specified correctly in the header -- it should be Integer but says String so bcftools can't use this for filtering:

##INFO=<ID=SUPP,Number=1,Type=String,Description="Number of samples supporting the variant">

Could you update the data type for SUPP and other tags in the header (e.g., CIEND, CIPOS) to Integer?

Thanks,

Andrew

fritzsedlazeck commented 5 years ago

I could, but some are not reporting it and some are not very robust. Maybe I just should .. sorry for loud thinking..

oleraj commented 5 years ago

No problem, that makes sense. For now I think the highest priority would be to update the type for these tags to Integer as I mentioned.