Zilong-Li / vcfpp

a C++ API of htslib to be easily integrated and safely used. More importantly, it can be callled seamlessly in R/Python/Julia etc.
https://doi.org/10.1093/bioinformatics/btae049
MIT License
23 stars 3 forks source link
bioinformatics genomics htslib population-genetics variant-analysis vcf

+TITLE: vcfpp: a single C++ file for manipulating VCF/BCF

[[https://github.com/Zilong-Li/vcfpp/actions/workflows/linux.yml/badge.svg]] [[https://github.com/Zilong-Li/vcfpp/actions/workflows/mac.yml/badge.svg]] [[https://zilongli.org/proj/vcfpp/index.html][https://img.shields.io/badge/Documentation-latest-blue.svg]] [[https://github.com/Zilong-Li/vcfpp/releases/latest][https://img.shields.io/github/v/release/Zilong-Li/vcfpp.svg]] [[https://img.shields.io/github/license/Zilong-Li/vcfpp?style=plastic.svg]]

This project introduces vcfpp (vcf plus plus), a single C++ file as interface to the basic =htslib=, which can be easily included in a C++ program for scripting high-performance genomic analyses. Check out [[https://github.com/Zilong-Li/vcfppR][vcfppR]] as an example of how vcfpp.h can facilitate rapidly writing high-performance R package.

Features:

  1. install [[https://github.com/samtools/htslib][htslib]] on your system
  2. download the released [[https://github.com/Zilong-Li/vcfpp/releases/latest][vcfpp.h]]
  3. put =vcfpp.h= in the same folder as your cpp source file or a folder say =/my/writable/path/= or the system path

The documentation of API is [[https://zilongli.org/proj/vcfpp/index.html][here.]]

** Reading VCF

In this example, we count the number of heterozygous genotypes for each sample in all records. You can paste the example code into a =example.cpp= file and compile it by =g++ example.cpp -std=c++11 -O3 -Wall -I. -lhts=. You can replace =-I.= with =-I/my/writable/path/= if you put =vcfpp.h= there.

+begin_src C++

include "vcfpp.h"

using namespace std; using namespace vcfpp; int main(int argc, char argv[]) { // read data from 1000 genomes server BcfReader vcf("https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/20220422_3202_phased_SNV_INDEL_SV/1kGP_high_coverage_Illumina.chr21.filtered.SNV_INDEL_SV_phased_panel.vcf.gz"); BcfRecord var(vcf.header); // construct a variant record vector gt; // genotype can be bool, char or int type vector hetsum(vcf.nsamples, 0); while (vcf.getNextVariant(var)) { var.getGenotypes(gt); if (!var.isSNP() || !var.isNoneMissing()) continue; assert(var.ploidy()==2); // make sure it is diploidy for(int i = 0; i < gt.size() / 2; i++) hetsum[i] += abs(gt[2 i + 0] - gt[2 * i + 1]); } for (auto i : hetsum) { cout << i << endl; } return 0; }

+end_src

** Genotypes Coding

There are 3 types used for genotypes, ie. =vector=, =vector= and =vector=. One can use =vector= and =vector= for memory-effcient goal. The downside is that it only stores 0 and 1. And =vector= can store missing values and multialleles.

*** Code genotypes with missing allele as heterozygous.

If you use =vector= and =vector= to store the genotypes, then there is no way to represent missing values. Hence the returned genotypes always have 0s and 1s. And genotypes with missing allele (eg. =0/.=, =./0=, =1/.=, =./1=, =./.=) are codes as =1/0=. It's recommended to use =var.isNoneMissing()= to check if there is missing value.

*** Code missing allele as -9

If this default behavior for =vector= and =vector= is not what you want, you should use =vector= to store the genotypes, then any missing allele will be coded as =-9=. Note you should take the missing value =-9= into account for downstream analysis.

** Writing VCF

There are many ways for writing the VCF/BCF file.

*** Use an empty template

Here we construct an initial BCF with header using VCF4.3 specification. Next we add meta data in the header and write out variant record given a string.

+begin_src C++

BcfWriter bw("out.bcf.gz", "VCF4.3"); bw.header.addFORMAT("GT", "1", "String", "Genotype"); bw.header.addINFO("AF", "A", "Float", "Estimated allele frequency in the range (0,1)"); bw.header.addContig("chr20"); // add chromosome for (auto& s : {"id01", "id02", "id03"}) bw.header.addSample(s); // add 3 samples bw.writeLine("chr20\t2006060\trs146931526\tG\tC\t100\tPASS\tAF=0.000998403\tGT\t1|0\t1|1\t0|0");

+end_src

*** Use an existing VCF as template

In this example, we first read VCF file =test/test-vcf-read.vcf.gz=. Second, we construct a variant record and update the record with the input VCF. Third, we construct a BcfWriter object using the meta data in the header of the input VCF, writing out the header and the modified variant record.

+begin_src C++

BcfReader br("test/test.vcf.gz"); BcfWriter bw("out.vcf.gz", br.header); BcfRecord var(br.header); br.getNextVariant(var); bw.writeHeader(); var.setPOS(100001); // update the POS of the variant bw.writeRecord(var);

+end_src

In some cases where we want to modify the samples of a template VCF, we can associate the =BcfWriter= and =BcfRecord= with another modifiable header instead of the non-modifiable header as the above.

+begin_src C++

BcfReader br("test/test.vcf.gz"); BcfWriter bw("out.vcf.gz"); // copy header of template vcf and restrict to target sample bw.copyHeader("test/test.vcf.gz", "HG00097"); BcfRecord var(bw.header); br.getNextVariant(var); var.setPOS(100001); bw.writeRecord(var); // output only sample HG00097

+end_src

** Variants Operation

All variants related API can be found [[https://zilongli.org/proj/vcfpp/classvcfpp_1_1_bcf_record][BcfRecord]]. The commonly used are listed below.

+begin_src C++

BcfReader vcf("bcf.gz"); // construct a vcf reader BcfRecord var(vcf.header); // construct an empty variant record associated with vcf header vcf.getNextVariant(var) // get next variant vector gt; // genotype can be bool, char or int type var.getGenotypes(gt), var.setGenotypes(gt); // get or set genotypes for current variant var.isNoneMissing(); // check if there is missing value after getting genotypes vector gq; // genotype quality usually is of int type var.getFORMAT("GQ",gq), var.setFORMAT("GQ",gq); // get or set a vector of genotypes quality vector pl; // Phred-scaled genotype likelihoods usually is of int type var.getFORMAT("PL",pl); // get a vector of Phred-scaled genotype likelihoods float af; var.getINFO("AF", af), var.setINFO("AF", af); // get or set AF (allele frequency) value in INFO int mq; var.getINFO("MQ",mq) // get MQ (Average mapping quality) value from INFO vector dp4; // Number of high-quality ref-forward , ref-reverse, alt-forward and alt-reverse bases var.getINFO("DP4", dp4), var.setINFO("DP4", dp4); // get or set a vector of dp4 value from INFO var.isSNP(); // check if variant is SNP var.isSV(); // check if variant is SV var.isIndel(); // check if variant is indel var.isMultiAllelic(); // check if variant is MultiAllelic var.POS(), var.setPOS(); // get POS or modify POS

+end_src

** Header Operation

All variants related API can be found in [[https://zilongli.org/proj/vcfpp/classvcfpp_1_1_bcf_header][BcfHeader]].

Examples of vcfpp working with R are in folder [[Rcpp]] and https://github.com/Zilong-Li/vcfppR.

Examples of vcfpp working with Python are in folder [[Pybind11]].

Find more useful command line tools in folder [[tools]].