HTGenomeAnalysisUnit / SCE-VCF

Sample Contamination Estimate from VCF
MIT License
17 stars 0 forks source link

Sample Contamination Estimate from VCF (SCE-VCF)

SCE-VCF is a simple tool to evalute sample contamination from sequencing experiments. It uses WGS/WES VCF files to compute few metrics useful to detect potential sample contamination:

The ideal scenario is a cohort analysis so one can compare value distributions to identify outliers. Even a small cohort of 50 samples works fine in our test.

Example distributions for contamination values based on 1000G 30X WGS samples and variants identified using either GATK or DeepVariant are provided in the example folder.

Installation

On most Linux based systems, you can just download the pre-compiled executable from the latest release and it should work out of the box.

Compile from source

Compiling the executable with nim has the following requirements:

You can usually install them with nimble install hts argparse.

Then you can clone the repository and compile the executable with nim c -d:release src/sceVCF.nim.

Generate a statically linked executable

If you have singularity installed, you can generate a statically linked executable with the following command:

bash nim_compile.sh

See [here](See: https://github.com/brentp/hts-nim#static-builds) for more details on how this works.

Usage

Usage:
  sceVCF [options] [vcf ...]

Arguments:
  [vcf ...]        input VCF/BCF file(s). Glob pattern allowed

Options:
  -h, --help
  -o, --out=OUT              Output file (TSV). If not provided output to stdout
  -f, --ad_field=AD_FIELD    FORMAT field containing the AD values (default: AD)
  -a, --af_field=AF_FIELD    INFO field containing the allele frequency (default: AF)
  -i, --is_refAF             AF in the af_field is the REF AF
  -t, --refaf_limit=REFAF_LIMIT
                             Limits of REF AF. Comma-separated lower and upper limit (default: 0.1,0.9)
  -l, --het_ab_limit=HET_AB_LIMIT
                             Comma separated min and max allele balance accepted for het calls (default: 0.25,0.75)
  -q, --min_GQ=MIN_GQ        Min GQ for hom var to be included in charr computation (default: 20)
  -d, --dp_limit=DP_LIMIT    Limits of genotype DP. Comma-separated lower and upper limit (default: 20,100)
  -s, --samples=SAMPLES      Restrict analysis to the given samples. Comma separated list or file wiht one sample per line
  -r, --region=REGION        Specify genomic region for processing. Format is chr[:start-end]. Comma-separated list of regions or file with 1 region per line.

How it works

  1. Multi-allelic variants and variants with REF AF outside the configured limits (refaf_limit) are skipped.
  2. Any genotype with DP outside the configure DP limits dp_limit is skipped.
  3. Genotypes with GQ less than min_GQ are used only to compute inconsistent het rate, but discarded when computed het rate and CHARR.
  4. CHARR is computed as ADref / (AFref * ADtot) considering only high-quality hom alt genotypes.

Suggested workflow

If you are processing a cohort VCF that contains enough samples to reliably estimate the cohort allele frequency and this value is annoteated in the INFO column (usually AF), then you can use your file directly to estimate contamination in each sample.

NB. Decomposing the file with bcftools norm may lead to unexpected results due to variants derived from multi-allelic split being included in the computation. We suggest to instead select only biallelic SNPs (for example bcftools view -f PASS,. -m2 -M2 -v snps) and then compute contamination metrics on this subset.

If you are working with a single sample VCF or a small cohort we suggest to proceed as follows:

  1. Select only PASS bi-allelic SNPs using bcftools view -f PASS,. -m2 -M2 -v snps
  2. Annotate the resulting file with AF from large reference population. For example, you can rapidly annotate global AF from gnomAD using echtvar (see the official repo)
  3. Perform contamination estimation

Output columns

Column Description
SAMPLE Sample ID as read from input VCF file
HQ_HOM Number of high quality homozygous sites used in CHARR computation
HQ_HOM_RATE Fraction of HQ hom sites across all hom sites with enough coverage
HQ_HET Number of high quality heterozygous sites used to compute het rate
HQ_HET_RATE Fraction of HQ het sites across all het sites with enough coverage
CHARR CHARR value computed as explained above using only HQ hom alt sites
MEAN_REF_AB_HOM_ALT Mean ref allele frequency observed in HQ hom alt sites
HETEROZYGOSITY_RATE Heterozygoisty ratio (Nhet / (Nhet + Nhom)) computed using only HQ genotypes
INCONSISTENT_AB_HET_RATE Fraction of het sites with AB outside threashold across all het sites with enough coverage (this includes all het sites with DP > threashold, disregarding GQ)

Interpretation of results

If you have analyzed a large enough cohort or you have a cohort processed with a similar pipeline for variant-calling, we suggest to identify contaminated samples considering the distribution of CHARR, HETEROZYGOSITY_RATE, and INCONSISTENT_AB_HET_RATE values across all individuals. For example, one can mark as possibly contaminated the outliers with values larger than mean + 2 or 3 SD.

When evaluating a single sample, a CHARR value of 0.02-0.03 may indicate a subtle contamination (1-5%) and a value larger than 0.03 is strongly suggestive of contamination (contamination > 5%). However, we suggest to jointly evaluate all the 3 metrics when determining contaminated sample. When a sample is heavily contaminated (> 20%), CHARR metric may look good, but you can observe high value for HETEROZYGOSITY_RATE (>= 0.7) and INCONSISTENT_AB_HET_RATE (>= 0.1).

In the end we suggest the following thresholds for contamined samples:

This approach classify samples similarly to traditional freemix approach, but it is more robust to contamination of different magnitude and it is more sensitive to subtle contamination.

Simulation results

To simulate contamination we generated mixed samples using 2 samples from the GIAB reference datasets. Briefly, we generated a 30X BAM file of NA24631 GIAB sample and then mixed this with various proportion of reads from NA24385. See results below and plots in the example folder.

SAMPLE HQ_HOM HQ_HOM_RATE HQ_HET HQ_HET_RATE CHARR MEAN_REF_AB_HOM_ALT HETEROZYGOSITY_RATE INCONSISTENT_AB_HET_RATE
original_sample 1101063 1.00000 1516921 0.96191 0.01144 0.00371 0.57942 0.00999
cont_0.99_0.01 1092683 1.00000 1515890 0.96113 0.01669 0.00598 0.58112 0.01068
cont_0.98_0.02 1079498 1.00000 1514989 0.95938 0.02134 0.00794 0.58393 0.01249
cont_0.97_0.03 1062034 1.00000 1514340 0.95648 0.02543 0.00963 0.58778 0.01580
cont_0.96_0.04 1040781 1.00000 1514274 0.95240 0.02899 0.01107 0.59266 0.02079
cont_0.95_0.05 1027437 1.00000 1512849 0.94977 0.03044 0.01164 0.59554 0.02412
cont_0.85_0.15 741203 1.00000 1578589 0.86162 0.04422 0.01578 0.68049 0.15782
cont_0.9_0.1 874280 1.00000 1532181 0.90654 0.04179 0.01559 0.63669 0.08385
cont_0.8_0.2 635342 1.00000 1652251 0.82522 0.04192 0.01431 0.72227 0.22597
cont_0.7_0.3 505642 1.00000 1832859 0.78729 0.03092 0.00972 0.78378 0.29380
cont_0.5_0.5 409987 1.00000 2103835 0.78209 0.01383 0.00371 0.83691 0.29484

Running time

Running time on a single sample WGS VCF containing about 4M variants is ~25 secs. Analysis of the full 1000G cohort VCF containing 2504 samples and about 120M variants takes ~20h.

Notes