arogozhnikov / demuxalot

Reliable, scalable, efficient demultiplexing for single-cell RNA sequencing
MIT License
24 stars 3 forks source link
biotech demultiplexing scrnaseq single-cell-analysis

demuxalot_logo_small

Run tests and deploy

Demuxalot

Reliable and efficient identification of genotypes for individual cells in RNA sequencing. Demuxalot refines its knowledge about genotypes directly from the data.

Demuxalot is fast and optimized to work with lots of genotypes, enabling efficient reutilization of inferred information from the data.

Preprint is available at biorxiv.

Background

During single-cell RNA-sequencing (scRnaSeq) we pool cells from different donors and process them together.

Demuxalot solves the con: it guesses genotype of each cell by matching reads coming from cell against genotypes. This is called demultiplexing.

Comparisons

Demuxalot shows high reliability, data efficiency and speed. Below is a benchmark on PMBC data with 32 donors from preprint

Screen Shot 2021-06-03 at 6 03 12 PM

Known genotypes and refined genotypes: the tale of two scenarios

Typical approach to get genotype-specific mutations are

Why is it worth refining genotypes?

SNP array provides up to ~650k positions in the genome. Around 20-30% of them would be specific for a genotype (i.e. deviate from majority).

Each genotype has around 10 times more SNV (single nucleotide variations) that are not captured by array. Some of these missing SNPs are very valuable for demultiplexing.

What's special power of demuxalot?

Installation

Plain and simple:

pip install demuxalot # Requires python >= 3.8

Here are some common scenarios and how they are implemented in demuxalot. Also visit examples/ folder

Running (simple scenario)

Only using provided genotypes

from demuxalot import Demultiplexer, BarcodeHandler, ProbabilisticGenotypes, count_snps

# Loading genotypes
genotypes = ProbabilisticGenotypes(genotype_names=['Donor1', 'Donor2', 'Donor3'])
genotypes.add_vcf('path/to/genotypes.vcf')

# Loading barcodes
barcode_handler = BarcodeHandler.from_file('path/to/barcodes.csv')

snps = count_snps(
    bamfile_location='path/to/sorted_alignments.bam',
    chromosome2positions=genotypes.get_chromosome2positions(),
    barcode_handler=barcode_handler, 
)

# returns two dataframes with likelihoods and posterior probabilities 
likelihoods, posterior_probabilities = Demultiplexer.predict_posteriors(
    snps,
    genotypes=genotypes,
    barcode_handler=barcode_handler,
)

Running (complex scenario)

Refinement of known genotypes is shown in a notebook, see examples/

Saving/loading genotypes

# You can always export learnt genotypes to be used later
refined_genotypes.save_betas('learnt_genotypes.parquet')
refined_genotypes = ProbabilisticGenotypes(genotype_names= <list which genotypes to load here>)
refined_genotypes.add_prior_betas('learnt_genotypes.parquet')

Re-saving VCF genotypes with betas (recommended)

Loading of internal parquet-based format is much faster than parsing/validating VCF. Makes sense to export VCF to internal format in two cases:

  1. when you plan to load it many times.
  2. when you want to 'accumulate' inferred information about genotypes from multiple scnraseq runs
genotypes = ProbabilisticGenotypes(genotype_names=['Donor1', 'Donor2', 'Donor3'])
genotypes.add_vcf('path/to/genotypes.vcf')
genotypes.save_betas('learnt_genotypes.parquet')

# later you can use it. 
genotypes = ProbabilisticGenotypes(genotype_names=['Donor1', 'Donor2', 'Donor3'])
genotypes.add_prior_betas('learnt_genotypes.parquet')