bayesomicslab / ONT-nonb-GoFAE-DND

Deep statistical model for predicting non-B DNA structures from ONT sequencing
3 stars 0 forks source link

ONT-nonb-GoFAE-DND

Deep statistical model for predicting non-B DNA structures from ONT sequencing

What is non-B DNA?

Non-canonical (or non-B) DNA are genomic regions whose three-dimensional conformation deviates from the canonical double helix. Non-B DNA play an important role in basic cellular processes and are associated with genomic instability, gene regulation, and oncogenesis. Experimental methods are low-throughput and can detect only a limited set of non-B DNA structures, while computational methods rely on non-B DNA base motifs, which are necessary but not sufficient indicators of non-B structures. The DNA conformations that we study here are the following:

Given the dramatic increase in genome-scale data produced using ONT platforms, and in particular ultra-long sequencing data that supports telomere-to-telomere level genome assembly, we sought to develop a parallel strategy for identifying non-B DNA structure by their effects on sequencing speeds (translocation times) in ONT devices (Ni et al., 2019, Liu et al., 2019, McIntyre et al., 2019, Stoiber et al., 2017).

In SMRT technology, the sequencing speed is determined by polymerization (Liu et al., 2018). However, ONT devices record a measurement of current at a predefined sampling rate and then aggregate the measurements into strides, which are the smallest length of measurement accepted by the basecaller and represent a single base translocation (An introduction to the concept of events and strides).

Cloning the Project and Creating the Environment

These instructions assume that you have miniconda or anaconda installed. To begin, clone the GoFAE-DND project. In a command line with git in your path:

$ git clone https://github.com/bayesomicslab/ONT-nonb-GoFAE-DND.git

Then, change directory into the ONT-nonb-GoFAE-DND/ folder and create and activate the conda environment.

$ conda env create -f GoFAE-DND/gofaednd/gofaednd_env.yml
$ conda activate gofaednd_env

Running GoFAE-DND

To run the GoFAE-DND, change directory and run the Main.py file.

$ cd GoFAE-DND/gofaednd
$ python Main.py --help

Though, to understand how to run the GoFAE-DND, we recommend executing the following code which generates simulated translocation times and then runs the GoFAE-DND method along with classifiers and novelty detectors.

Simulations

1. Make simulated windows

This is an example of command that simulate 100,000 B-DNA and 1,000 non-B DNA windows for G-quadruples and Short Tandem Repeat. You must have matplotlib, numpy, and pandas installed. In a command line with python in your path, change directory into simulator/ and then run the simulator, which specifies the amount of non-B and B DNA samples. This may take a couple minutes:

 $ cd simulator
 $ python simulator.py -nb 10000 -b 1000000 -s ../simulated_data 

2. Run GoFAE-DND on simulated data

All python dependencies for the GoFAE-DND are in gofaednd_env.yml

$ cd ../GoFAE-DND/gofaednd
$ python Main.py --data_type=simulated --sim_data_path=../../simulated_data/ --config=1 --nonb_type=G_Quadruplex_Motif --nonb_ratio=0.1 --n_z=64 --num_projections=64 --epochs=25 --fdr_level=0.2 --discriminative_weight=35 --lambda_alpha=0.5 
$ python Main.py --data_type=simulated --sim_data_path=../../simulated_data/ --config=2 --nonb_type=Short_tandem_repeat --nonb_ratio=0.1 --n_z=64 --num_projections=64 --epochs=25 --fdr_level=0.2 --discriminative_weight=35 --lambda_alpha=0.5 

Note: to get a list of all parameters, run the command:

$ python Main.py --help

The results including spreadsheets of metrics and diagnostic images are included in the root-level results/ directory.

3. Run novelty detection methods

To run the other methods, we will need to install seaborn and h5py first.

$ conda install seaborn
$ conda install h5py

Afterwards, we can run isolation forests, local outlier factors, and one class SVMs.

  $ cd ../../novelty_detectors
  $ python isolation_forest.py -W ignore -d sim -f ../simulated_data/ -r ../results/ -nb 20000 -b 200000 
  $ python local_outlier_factor.py -W ignore -d sim -f ../simulated_data/ -r ../results/ -nb 20000 -b 200000 
  $ python svm_one_class.py -W ignore -d sim -f ../simulated_data/ -r ../results/ -nb 20000 -b 200000   

4. Run classifiers

$ cd ../classifiers
$ python -W ignore svc.py -d sim -f ../simulated_data/ -r ../results/ -nb 20000 -b 200000
$ python -W ignore random_forest.py -d sim -f ../simulated_data/ -r ../results/ -nb 20000 -b 200000
$ python -W ignore nearest_neighbors.py -d sim -f ../simulated_data/ -r ../results/ -nb 20000 -b 200000
$ python -W ignore logistic_regression.py -d sim -f ../simulated_data/ -r ../results/ -nb 20000 -b 200000
$ python -W ignore gaussian_process.py -d sim -f ../simulated_data/ -r ../results/ -nb 20000 -b 200000

The results for both novelty detectors and classifiers can be found in the root-level results directory.

//: # ()

//: # ()

Experimental data

Instructions for downloading the data and GitHub repo

1) Create or change directory where you want to reproduce the results.

2) Download and extract the data.

   First, we download the Nanopore fast5 file from: 
   https://trace.ncbi.nlm.nih.gov/Traces/?view=run_browser&acc=SRR15058166&display=metadata
   ### Note: this may be slow as the FAST5 files total over 4TB 
   For convenience, we provide a download.sh script in the script folder that will 
   download, unpack, and clean up the directory.

    ```
    $ sh download.sh
    ```

Processing Reads

Sequence bases are called from the raw ONT current using Albacore, which generates an event table that describes the DNA context in the nanopore (Loman et al., 2015).

Albacore Basecalling:

 $ read_fast5_basecaller.py -f FLO-PRO002 -k SQK-LSK109 --input $path/na12878/fast5/single/ --save_path $path/na12878/fast5/albacore_single/ --output_format fastq,fast5 -t 48 --recursive --config r941_450bps_linear_prom.cfg

Subsequently, we re-squiggle the FAST5 output of Albacore using Tombo, a statistical method that detects base modifications in nanopore current signal (Stoiber et al., 2017). Briefly, the re-squiggling algorithm segments the raw current signal into events and calls nucleotide bases using the current and a reference genome for correcting spurious variation (Figures, top and middle).

The Tombo segmentation provides current measurements at the base-level, unlike Albacore, which assumes the block stride attribute remains fixed, which enables the computation of translocation times (Figures, bottom and bottom). For each position on the Tombo-mapped reads, we compute the time duration in seconds as the ratio of the number of current measurements to the ONT sampling rate.

Tombo re-squiggle:

$ tombo resquiggle $path/workspace/pass/ hg38.fa --dna --overwrite --basecall-group Basecall_1D_001 --include-event-stdev --failed-reads-filename $path/workspace/pass/tombo_failed_reads.txt --processes 48

Prepare the windows

Step 1:

Extract motifs postions from non-B DNA DB.

Step 2:

Fix windows of length x around motifs.

Step 3:

Extend the positions on the opposite strand.

Step 4:

Find high quality reads that fall on the windows.

Step 5:

Find motif free regions.

Step 6:

Compute translocation signal on the non-overlapping windows.