We present G4mismatch, a convolutional neural network for the prediction of DNA G-quadruplex (G4) mismatch scores. We couple Gemismatch with a scanner, capable of detecting potential G4forming sequences in any given input sequence.
These instructions will get you a copy of the project up and running on your local machine for development and testing purposes. See deployment for notes on how to deploy the project on a live system.
The models supplied models were implemented using TensorFlow 2.3.
To run G4mismast from command line, first change into its derectory. G4mismatch requires several input arguments:
python G4mismatch.py \
-tp <path to training set coordinate file> \
-op <path for output directory> \
-gp <path to relevent reference genome> \
Use python G4mismatch.py --help
to view the complete list of input arguments.
The input to G4mismatch is a tab-deliminated file with 5 columns (no headers): chromosome, start, end, mismatch score and strand (-
for forward strand, +
for reverse strand, according to the G4-seq convention). An example of how the G4-seq human data is prepared for trainin is given in prep_data.sh
To use G4mismatch trained model for the detection of potential G4 forming sequences, run:
python G4mismatch_scan.py \
-dp <path to fasta file> \
-mp <path pre-trained G4mismatch model> \
-of <path to directory for output files> \
-m 1 \
-fb 14.4 \
-bs 128 \
To get the mismatch score for a full sequence the drop -fb
argument.
The G4-seq data use for G4mismatch training is available at: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE110582
For training G4mismatch the human chromosome 2 was used for validation, chromosome 1 was a held-out ttest set and the rest were used for training.