bioinform / somaticseq

An ensemble approach to accurately detect somatic mutations using SomaticSeq
http://bioinform.github.io/somaticseq/
BSD 2-Clause "Simplified" License
189 stars 53 forks source link

SEQC2: Some high confidence SNVs and INDELs in VCF are outside of regions defined by High-Confidence_Regions_v1.2.bed #116

Closed luederm closed 1 year ago

luederm commented 1 year ago

I downloaded high-confidence_sINDEL_in_HC_regions_v1.2.vcf.gz and high-confidence_sINDEL_in_HC_regions_v1.2.vcf.gz from FTP but noticed that some of the variants are not in the regions defined by High-Confidence_Regions_v1.2.bed. This led to issues when I compared my results (after filtering using the supplied BED file) to the HC reference call set.

bcftools view -T ^High-Confidence_Regions_v1.2.bed high-confidence_sSNV_in_HC_regions_v1.2.vcf.gz | bcftools +counts
Number of samples: 0
Number of SNPs:    113
Number of INDELs:  0
Number of MNPs:    0
Number of others:  0
Number of sites:   113
bcftools view --targets-overlap 0 -T ^High-Confidence_Regions_v1.2.bed high-confidence_sINDEL_in_HC_regions_v1.2.vcf.gz | bcftools +counts
Number of samples: 0
Number of SNPs:    0
Number of INDELs:  320
Number of MNPs:    0
Number of others:  0
Number of sites:   320
bcftools view --targets-overlap 1 -T ^High-Confidence_Regions_v1.2.bed high-confidence_sINDEL_in_HC_regions_v1.2.vcf.gz | bcftools +counts
Number of samples: 0
Number of SNPs:    0
Number of INDELs:  297
Number of MNPs:    0
Number of others:  0
Number of sites:   297
litaifang commented 1 year ago

Thanks for pointing that out to me. Some calls outside the high confidence regions were left in those files. I'll make a note of that in README and release corrected versions of those high-confidence-sSNV/INDEL files.

litaifang commented 1 year ago

SEQC2 has an update with README.md there explaining the differences (and why): https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/seqc/Somatic_Mutation_WG/release/latest/