fasterius / VarClust

A Python package for clustering of single nucleotide variants from high-through seqencing data.
Other
5 stars 3 forks source link

VarClust

License: MIT

VarClust is a Python package that performs clustering of high-throughput sequencing (HTS) data using single nucleotide variants (SNVs). VarClust analyses variants stored in VCF files, which are the output of variant callers such as the Genome Analysis ToolKit. While VarClust was developed for the analysis of single cell RNA sequencing (scRNA-seq) data, any variant data stored in VCF files may be analysed regardless of whether it originates in DNA- or RNA-based methods.

Installation

VarClust can be installed from GitHub using pip:

pip install git+https://github.com/fasterius/VarClust

Usage

While VarClust is a Python package and may thus be utilised as such (i.e. by importing it and using each included function as desired), its main interface is through the command line. It has five modules, each performing a separate function: creation of SNV profiles, calculation of genetic distance matrices, aggregation of specified profiles into "pseudo-profiles" and, lastly, clustering using either hierarchical agglomerative clustering (HAC) or t-distributed stochastic neighbour embedding (tSNE).

A brief guide on how to use each of VarClust's modules is provided here, but additional details can be accessed by passing the -h or --help flag after the command. There are a number of parameters that may be changed according to your specific needs, such as using only a specific subset of variants for the distance calculation (e.g excluding variants present in the dbSNP database or those that do not pass some quality threshold).

Given a directory of single-sample VCF files, the first step is to create an SNV profile for each. This can be done using the following code, but requires that the filename is identical to the sample in the VCF (minus the .vcf or .vcf.gz suffixes). For example, a file named sample_1.vcf contains the sample sample_1.

varclust_create_profiles <VCF directory> <output profile directory>

(Keep in mind that the command line-version of VarClust can only create profiles for single-sample VCF files (whether that sample be a single cell or bulk sequencing), following the naming scheme previously mentioned. The lower-level python module has functions for dealing with multi-sample VCFs, though, so you are also free to use those if they are more suitable for your needs.)

The next step is to create a pairwise distance matrix for the genetic similarities between each sample:

varclust_distance_matrix <profile directory> <output distance matrix path>

Clustering using either HAC or tSNE may then be performed using the resulting distance matrix and a metadata-file:

varclust_heatmap <distance matrix> <output figure path>
varclust_tsne <distance matrix> <metadata file> <output figure path>
              -M <metadata ID col> -c <colour col> -s <shape col>

The metadata-file must at least contain (1) an ID column corresponding to the sample IDs used to create the distance matrix, and (2) a grouping column that will be used for e.g. clustering or colouring the groups.

The variants in the profiles may also be aggregated into pseudo-profiles, where the variants and the number of times they occur in each included profile will be enumerated:

varclust_pseudo <profile directory> <output pseudo-profile path>

Citation

If you are using VarClust to analyse your data, please cite the following article:

Single-cell RNA-seq variant analysis for exploration of genetic heterogeneity in cancer
Fasterius E., Uhlén M and Al-Khalili Szigyarto C.
Scientific Reports (2019), 9(1), 1–11
[https://doi.org/10.1038/s41598-019-45934-1]()

Licence

VarClust is released with a MIT licence. VarClust is free software: you may redistribute it and/or modify it under the terms of the MIT license. For more information, please see the LICENCE file that comes with the package.