This project is for CSE 185. It implements a subset of mpileup2snp
and finds SNPs within a given aligned genome in the form of a mpileup file. See VarScan for more details
REQUIREMENTS | INSTALLATION | BASIC USAGE | OPTIONAL | File formats
These packages can be installed with pip
and brew
:
brew install mpich
pip install scipy.stats
Note: If you do not have root access, you can run the command above with additional options to install locally:
pip install --user scipy.stats
pip install mVarScan
Note: If you run into an error: externally-managed-environment
while pip installing mVarScan, you can create a virtual python environment and install and use mVarScan there using the following commands:
python -m venv ~/myenv # create a new python env
source ~/myenv/bin/activate # activate myenv
pip install mVarScan
python -m mVarScan -h # test mVarScan installation
and when you're done using it:
deactivate
rm -rf ~/myenv # delete myenv
The basic usage of mVarScan
is:
python mVarScan [options] [mpileup]
-o --out FILENAME (file to output contents to)
-t --tab (1 for yes) (output using TAB formatting, default: 0)
-m --min-var-frequency FREQUENCY (minimum frequency to call a non-reference mutation, default: 0.2)
-h --min-freq-for-hom FREQUENCY (minimum frequency to call a non-reference homozygous mutation, default: 0.8)
-p --pvalue FLOAT (p-value threshold to output SNP, default: 0.99)
-r2 --min-reads2 INT (minimum supporting reads at a position to call variants, default: 2)
-c --min-coverage INT (Minimum read depth at a position to make a call. Default 8)
-q --min-avg-qual INT (minimum average base quality at a position to count a read, default: 15)
mpileup
A mpileup file is a tab-delimited text file with no header, traditionally generated by samtools mpileup
. It contains 6 columns:
chromosome position reference_base coverage read_bases read_qualities [optional extra columns]
chr[name]
where [name]
is the chromosome numberExample:
chr6 128405804 T 22 ...................... DE:EFFImEJIJJIJ>JJIJHF
tab
A tab file is a tab-delimited text file that is a modified VCF format. It includes similar columns, but differs in what it displays. Below is the header used:
#CHROM tPOS REF ALT SAMPLE [other samples]
chr[name]
where [name]
is the chromosome numberExample:
chr6 128414945 c T 1/1:44,44:38.63636363636363:1.0:7.619481455868034e-26
regular output
The information about the snps above are printed in clearly labeled sections in the terminal. As seen below:
Chromosome:position | Sample # | homozygous_status | ref_base -> variant_base | frequency | p-value | reads, coverage | average base quality |
Example:
chr6:128414945 | Sample 1 | 1/1 | c -> T | frequency 1.00 | p-value 7.619481455868034e-26 | reads 44,44 | avg base quality 38.63636363636363|
--min-avg-qual
option sets the minimum average Phred quality score for bases to be considered in variant calling. Phred quality scores are a common metric in sequencing data quality control, indicating the probability of a base call being incorrect. For more information about Phred scores, refer to this link.This repository was generated by Andrew Bigelow, Aditya Parmar, and Numaan Formoli.