OrensteinLab / G4mismatch

1 stars 0 forks source link

G4mismatch

We present G4mismatch, a convolutional neural network for the prediction of DNA G-quadruplex (G4) mismatch scores. We couple Gemismatch with a scanner, capable of detecting potential G4forming sequences in any given input sequence.

Getting Started

These instructions will get you a copy of the project up and running on your local machine for development and testing purposes. See deployment for notes on how to deploy the project on a live system.

Prerequisites

The models supplied models were implemented using TensorFlow 2.3.

Usage

To run G4mismast from command line, first change into its derectory. G4mismatch requires several input arguments:

python G4mismatch.py \
       -tp <path to training set coordinate file> \
       -op <path for output directory> \
       -gp <path to relevent reference genome> \

Use python G4mismatch.py --help to view the complete list of input arguments. The input to G4mismatch is a tab-deliminated file with 5 columns (no headers): chromosome, start, end, mismatch score and strand (- for forward strand, + for reverse strand, according to the G4-seq convention). An example of how the G4-seq human data is prepared for trainin is given in prep_data.sh

To use G4mismatch trained model for the detection of potential G4 forming sequences, run:

python G4mismatch_scan.py \
       -dp <path to fasta file> \
       -mp <path pre-trained G4mismatch model> \
       -of <path to directory for output files> \
       -m 1 \
       -fb 14.4 \
       -bs 128 \

To get the mismatch score for a full sequence the drop -fb argument.

Datasets

The G4-seq data use for G4mismatch training is available at: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE110582

For training G4mismatch the human chromosome 2 was used for validation, chromosome 1 was a held-out ttest set and the rest were used for training.