Authors: Benjamin Levy, Shuying Ni, Zihao Xu, Liyang Zhao
Predicting gene expression levels from upstream promoter regions using deep learning. Collaboration between IACS and Inari.
scripts/
: directory for production code
0-data-loading-processing/
:
01-gene-expression.py
: downloads and processes gene expression data and saves into "B73_genex.txt".02-download-process-db-data.py
: downloads and processes gene sequences from a specified database: 'Ensembl', 'Maize', 'Maize_addition', 'Refseq'03-combine-databases.py
: combines all the downloaded sequences within all the databases04a-merge-genex-maize_seq.py
:04b-merge-genex-b73.py
:05a-cluster-maize_seq.sh
: clusters the promoter sequences into groups with up to 80% sequence identity, which may be interpreted as paralogs05b-train-test-split.py
: divides the promoter sequences into train and test sets, avoiding a set of pairs that indicate close relations ("paralogs")06_transformer_preparation.py
:07_train_tokenizer.py
: training byte-level BPE for RoBERTa model1-modeling/
pretrain.py
: training the FLORABERT base using a masked language modeling task. Type python scripts/1-modeling/pretrain.py --help
to see command line options, including choice of dataset and whether to warmstart from a partially trained model. Note: not all options will be used by this script.finetune.py
: training the FLORABERT regression model (including newly initialized regression head) on multitask regression for gene expression in all 10 tissues. Type python scripts/1-modeling/finetune.py --help
to see command line options; mainly for specifying data inputs and output directory for saving model weights.evaluate.py
: computing metrics for the trained FLORABERT modelembedding_vis.py
: computing a sample of BERT embeddings for the testing data and saving to a tensorboard log. Can specify how many embeddings to sample with --num-embeddings XX
where XX
is the number of embeddings (must be integer).module/
: directory for our customized modules
module/
: our main module named florabert
that packages customized functions
config.py
: project-wide configuration settings and absolute paths to important directories/filesdataio.py
: utilities for performing I/O operations (reading and writing to/from files)gene_db_io.py
: helper functions to download and process gene sequencesmetrics.py
: functions for evaluating modelsnlp.py
: custom classes and functions for working with text/sequencestraining.py
: helper functions that make it easier to train models in PyTorch and with Huggingface's Trainer API, as well as custom optimizers and schedulerstransformers.py
: implementation of RoBERTa model with mean-pooling of final token embeddings, as well as functions for loading and working with Huggingface's transformers libraryutils.py
: General-purpose functions and codevisualization.py
: helper functions to perform random k-mer flip during data processing and make model predictionIf you wish to experiment with our pre-trained FLORABERT models, you can find the saved PyTorch models and the Huggingface tokenizer files here
Contents:
byte-level-bpe-tokenizer
: Files expected by a Huggingface transformers.PretrainedTokenizer
merges.txt
vocab.txt
transformers
library. The prediction model should instantiate our custom RobertaForSequenceClassificationMeanPool
model class
language-model
: Trained on all plant promoter sequenceslanguage-model-finetuned
: Further trained on just maize promoter sequencesprediction-model
: Fine-tuned on the multitask regression problem