Nanopore sequencing is based on the principal of isolating a nanopore in a membrane separating buffered salt solutions, then applying a voltage across the membrane and monitoring the ionic current through the nanopore. The Oxford Nanopore Technologies (ONT) MinION sequences DNA by recording the ionic current as DNA strands are enzymatically guided through the nanopore. SignalAlign will align the ionic current from the MinION to a reference sequence using a trainable hidden Markov model (HMM). The emissions model for the HMM can either be the table of parametric normal distributions provided by ONT or a hierarchical Dirichlet process (HDP) mixture of normal distributions. The HDP models enable mapping of methylated bases to your reference sequence.
Given the installation is often long, tedious and somewhat fragile. So, we recommend using Docker to run signalAlign.
ucscbailey/signalalign:latest
We do not store any test data within the docker image so, in order to test the docker image, clone the repo and run the following command from within the signalAlign home directory.
docker run -it -v "$(pwd)":/data ucscbailey/signalalign:latest run --config tests/docker_runSignalAlign-config.json
There is an installation script which works and has been tested on a clean ubuntu18.08 server from aws.
install embed_fast5
git clone --recursive https://github.com/adbailey4/embed_fast5.git
cd embed_fast5
python3.7 -m pip install .
python3.7 -m pytest
Install signalAlign
git clone --recursive https://github.com/UCSC-nanopore-cgl/signalAlign.git
cd signalAlign && mkdir -p build && cd build
cmake .. -DCMAKE_INSTALL_PREFIX=. -DCMAKE_VERBOSE_MAKEFILE=ON -DBUILD_SHARED_LIBS=ON -DCMAKE_BUILD_TYPE=RELEASE
make -j 4
cd ..
python3.7 -m pip install .
python3.7 -m pytest
runSignalAlign.py run --config tests/runSignalAlign-config.json
{
"signal_alignment_args": {
"target_regions": null,
deprecated
"track_memory_usage": false,
deprecated
"threshold": 0.01,
probablity threshold for outputing an alignment between kmer and event
"event_table": null,
specify location of event table in fast5 (deprecated)
"embed": false,
embed signalalign data into fast5 (not recommened)
"delete_tmp": true,
delete temorary folders (highly recommened)
"output_format": "full"
specify output format "full", "assignments", "variantCaller" (see Output Formats)
},
"samples": [
{
"positions_file": "path/to/file.positions",
locations to call mods (see Positions File Specification)
"fast5_dirs": ["/path/to/fast5/"],
path to fast5 files
"bwa_reference": "/path/to/reference.fa",
path to reference
"fofns": [],
deprecated
"readdb": "/path/to/index.readdb",
path to readdb/index file
"fw_reference": null,
deprecated
"bw_reference": null,
deprecated
"kmers_from_reference": false,
deprecated
"motifs": null,
deprecated
"name": "wild_type",
name of sample
"alignment_file": "/path/to/alignment.bam",
path to alignment file
"recursive": false,
deprecated
},
],
"path_to_bin": "/path/to/signalalign/bin",
path to signalAlign bin where executables are stored
"complement_hdp_model": null,
path to complement_hdp_model
"template_hdp_model": null,
path to template_hdp_model
"complement_hmm_model": null,
path to complement_hmm_model
"template_hmm_model": "path/to/model.model",
path to template_hmm_model
"job_count": 96,
number of threads / jobs to use
"debug": false,
if things are failing set this to true
"two_d": false,
if 2d reads set to true
"output_dir": "/path/to/output",
path to output directory
"perform_kmer_event_alignment": true,
force pre alignment between kmers and events
"overwrite": true,
overwrite previous runs
"rna": true,
set to true if rna
"ambig_model": path/to/ambig.model",
ambiguous model (see Ambiguous Model Specification
"built_alignments": null,
deprecated
}
There are three output formats. full
, variantCaller
, and assignments
. Each read that aligns to the target regions (if you specified that option) will come as a separate file.
full
has the following tab-separated-format:
Contig | Reference Index | Reference k-mer | Read File | Strand | Event Index | Event Mean | Event Noise | Event Duration | Aligned k-mer | Scaled Mean Current | Scaled Noise | Posterior Probability | Descaled Event Mean | Model (ONT) Mean | Path k-mer |
---|
variantCaller
has the following tab-separated format:
Event Index | Reference Position | Base | Posterior Probability | Strand | Forward Mapped | Read File |
---|
finally, assignments
has the following tab-separated format:
k-mer | Read File | Descaled Event Mean | Posterior Probability |
---|
Tsv file with the columns seen below. Used to change the bwa_reference
characters to ambiguous characters defined
in both the ambig model and hmm/hdp model. see Example
contig (str) | 0_based_ref_index (int) | strand (str) | ref_char (str) | replacement_char (str) |
---|
Tsv file with the columns seen below. If a character in the find
column is found in the reference, the model
will create branch points for each of the replace
characters. This model allows you to specify a set of arbitrary characters
to represent a set of arbitrary modifications or variants. Note: Must be consistent with a HMM model.
find (str) | replace (str) |
---|
Required
tests/trainModels-config.json
{
"signal_alignment_args": {
SignalAlign arguments to be used for each sample
"target_regions": null,
If you want to specify regions of the genome to process
"track_memory_usage": false,
Option to print memory usage statistics
"threshold": 0.01,
Minimum output threshold for SignalAlign
"event_table": false,
Select specific path to Basecalled Event table
"embed": false,
If set, will embed output into fast5 and run the maximum expected accuracy algorithm
"delete_tmp": false
Will delete temporary folders after each run. (False is recommended so the temp files do not need to be generated each round of training)
},
"samples": [
List of samples to process. If you have a control and an experiment you can create 2 separate samples
{
"positions_file": null,
Way of selecting find and replace characters for reference genome
"fast5_dirs": ["./tests/minion_test_reads/one_R9_canonical_ecoli"],
List of directories to search for Fast5s
"bwa_reference": "./tests/test_sequences/E.coli_K12.fasta",
Path to unedited genome
"readdb": null,
read_id and fast5 path mapping
"fw_reference": null,
Forward reference with all characters replaced correctly
"bw_reference": null,
Backward reference with all characters replaced correctly
"kmers_from_reference": false,
Use only kmers from reference (Recommended when not using Human Genome)
"motifs": null,
Motif find and replace must be in specific list within a list format. eg: [['CCAGG', 'CEAGG'], ['CCTGG', 'CETGG']]
"name": "canonical",
Name of your sample
"probability_threshold": 0.8,
Minimum threshold for including event-kmer mapping in training data for HDP
"number_of_kmer_assignments": 10,
Number of assignments to create per kmer
"alignment_file": null,
BAM file of aligned reads
"recursive": false,
Search fast5_dirs recursively
"assignments_dir": null
Directory where assignments have been created. (Used for traingin HDP)
}
],
"hdp_args": {
"grid_start": 30.0,
Minimum pA coverage by HDP distribution
"grid_end": 120.0,
Maximum pA coverage by HDP distribution
"grid_length": 1200,
Number of points that cover the distribution
"base_alpha": 1.0,
HDP priors
"base_beta": 1.0,
"base_gamma": 1.0,
"middle_alpha": 1.0,
"middle_beta": 1.0,
"middle_gamma": 1.0,
"leaf_alpha": 1.0,
"leaf_beta": 1.0,
"leaf_gamma": 1.0,
"thinning": 100,
"gibbs_samples": 1000,
Number of draws to sample to create distribution
"burnin_multiplier": 32,
Helps calculate number of burnin samples to make
"hdp_type": "singleLevelFixedCanonical",
"threshold": 0.0,
"number_of_assignments": 100,
Number of assignments for each kmer
"built_alignments": null
If Alignments were already built
},
"transitions_args": {
"training_bases": 10000,
Number of bases to use for getting transition expectations
"iterations": 2,
number of iterations
"test": false
If set to true, will check if total probablity increases with each iteration
},
"training": {
"transitions": false,
"normal_emissions": true,
"hdp_emissions": true,
"expectation_maximization": false,
"em_iterations": 3
},
"path_to_bin": "./bin",
"complement_hdp_model": null,
"template_hdp_model": null,
"complement_hmm_model": "./models/testModelR9_5mer_acgt_complement.model",
"template_hmm_model": "./models/testModelR9_5mer_acgt_template.model",
"job_count": 2,
number of processes to run
"debug": false,
Will fail if any reads fail
"two_d": false,
"output_dir": "./",
"constraint_trim": null,
Number of anchor pairs to trim on either side of an anchor block
"diagonal_expansion": null,
Size of diagonal to calculate posterior probabilities
"traceBackDiagonals": 100,
Number of diagonals to calculate before using an estimate for total probability
"filter_reads": 7
Minimum average phred quality score for a read
}
("singleLevelFixed", 0),
("singleLevelPrior", 1),
("multisetFixed", 2),
("multisetPrior", 3),
("compFixed", 4),
("compPrior", 5),
("middleNtsFixed", 6),
("middleNtsPrior", 7),
("groupMultisetFixed", 8),
("groupMultisetPrior", 9),
("singleLevelPrior2", 10),
("multisetPrior2", 11),
("singleLevelFixedCanonical", 14),
("singleLevelFixedM6A", 15),
("singleLevelPrior2", 10),
("multisetPrior2", 11),
("singleLevelFixedCanonical", 14)
("multisetPriorEcoli", 12),
("singleLevelPriorEcoli", 13),
F = m6A
("singleLevelFixedM6A", 15),
template_trained.hmm
template HMM with trained parameterscomplement_trained.hmm
complement HMM with trained parameters if 2D
If HDPs were used, a copy of the input HDP will also be here, they have .nhdp
suffix.