Adibvafa / CodonTransformer

CodonTransformer: The ultimate tool for codon optimization, optimizing DNA sequences for heterologous protein expression across 164 species.
https://adibvafa.github.io/CodonTransformer
Apache License 2.0
104 stars 4 forks source link
bioinformatics biotechnology codon codon-optimization codon-optimizer computational-biology deep-learning gene-expression machine-learning synthetic-biology

CodonTransformer Logo

arXiv GitHub Website HuggingFace Model Colab

Table of Contents

Abstract

The genetic code is degenerate allowing a multitude of possible DNA sequences to encode the same protein. This degeneracy impacts the efficiency of heterologous protein production due to the codon usage preferences of each organism. The process of tailoring organism-specific synonymous codons, known as codon optimization, must respect local sequence patterns that go beyond global codon preferences. As a result, the search space faces a combinatorial explosion that makes exhaustive exploration impossible. Nevertheless, throughout the diverse life on Earth, natural selection has already optimized the sequences, thereby providing a rich source of data allowing machine learning algorithms to explore the underlying rules. Here, we introduce CodonTransformer, a multispecies deep learning model trained on over 1 million DNA-protein pairs from 164 organisms spanning all kingdoms of life. The model demonstrates context-awareness thanks to the attention mechanism and bidirectionality of the Transformers we used, and to a novel sequence representation that combines organism, amino acid, and codon encodings. CodonTransformer generates host-specific DNA sequences with natural-like codon distribution profiles and with negative cis-regulatory elements. This work introduces a novel strategy of Shared Token Representation and Encoding with Aligned Multi-masking (STREAM) and provides a state-of-the-art codon optimization framework with a customizable open-access model and a user-friendly interface.

Use Case

For an interactive demo, check out our Google Colab Notebook.

After installing CodonTransformer, you can use:

import torch
from transformers import AutoTokenizer, BigBirdForMaskedLM
from CodonTransformer.CodonPrediction import predict_dna_sequence
from CodonTransformer.CodonJupyter import format_model_output
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("adibvafa/CodonTransformer")
model = BigBirdForMaskedLM.from_pretrained("adibvafa/CodonTransformer").to(device)

# Set your input data
protein = "MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGG"
organism = "Escherichia coli general"

# Predict with CodonTransformer
output = predict_dna_sequence(
    protein=protein,
    organism=organism,
    device=device,
    tokenizer=tokenizer,
    model=model,
    attention_type="original_full",
    deterministic=True
)
print(format_model_output(output))

The output is:

-----------------------------
|          Organism         |
-----------------------------
Escherichia coli general

-----------------------------
|       Input Protein       |
-----------------------------
MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGG

-----------------------------
|      Processed Input      |
-----------------------------
M_UNK A_UNK L_UNK W_UNK M_UNK R_UNK L_UNK L_UNK P_UNK L_UNK L_UNK A_UNK L_UNK L_UNK A_UNK L_UNK W_UNK G_UNK P_UNK D_UNK P_UNK A_UNK A_UNK A_UNK F_UNK V_UNK N_UNK Q_UNK H_UNK L_UNK C_UNK G_UNK S_UNK H_UNK L_UNK V_UNK E_UNK A_UNK L_UNK Y_UNK L_UNK V_UNK C_UNK G_UNK E_UNK R_UNK G_UNK F_UNK F_UNK Y_UNK T_UNK P_UNK K_UNK T_UNK R_UNK R_UNK E_UNK A_UNK E_UNK D_UNK L_UNK Q_UNK V_UNK G_UNK Q_UNK V_UNK E_UNK L_UNK G_UNK G_UNK __UNK

-----------------------------
|       Predicted DNA       |
-----------------------------
ATGGCTTTATGGATGCGTCTGCTGCCGCTGCTGGCGCTGCTGGCGCTGTGGGGCCCGGACCCGGCGGCGGCGTTTGTGAATCAGCACCTGTGCGGCAGCCACCTGGTGGAAGCGCTGTATCTGGTGTGCGGTGAGCGCGGCTTCTTCTACACGCCCAAAACCCGCCGCGAAGCGGAAGATCTGCAGGTGGGCCAGGTGGAGCTGGGCGGCTAA

Generating Multiple Variable Sequences

Set deterministic=False to generate variable sequences. Control the variability using temperature:

Using high temperatures (e.g. more than 1) might result in prediction of DNA sequences that do not translate to the input protein.
You can set match_protein=True to ensure predicted DNA sequences translate to the input protein.
Generate multiple sequences by setting num_sequences to a value greater than 1.

Batch Inference

You can use the inference template to setup your dataset for batch inference in Google Colab. A sample dataset is provided under \demo . A typical inference might take 1-3 seconds based on available compute.


Arguments of predict_dna_sequence

Argument Type Description Default
protein str Input protein sequence Required
organism Union[int, str] Organism ID (integer) or name (string) (e.g., "Escherichia coli general") Required
device torch.device PyTorch device object specifying whether to run on CPU or GPU Required
tokenizer Union[str, PreTrainedTokenizerFast, None] Either a file path to load tokenizer from, a pre-loaded tokenizer object, or None to load from HuggingFace's "adibvafa/CodonTransformer" None
model Union[str, torch.nn.Module, None] Either a file path to load model from, a pre-loaded model object, or None to load from HuggingFace's "adibvafa/CodonTransformer" None
attention_type str Type of attention mechanism to use in model - 'block_sparse' for memory efficient or 'original_full' for standard attention "original_full"
deterministic bool If True, uses deterministic decoding (picks most likely tokens). If False, samples tokens based on probabilities adjusted by temperature True
temperature float Controls randomness in non-deterministic mode. Lower values (0.2) are conservative and pick high probability tokens, while higher values (0.8) allow more diversity. Must be positive 0.2
top_p float Nucleus sampling threshold - only tokens with cumulative probability up to this value are considered. Balances diversity and quality of predictions. Must be between 0 and 1 0.95
num_sequences int Number of different DNA sequences to generate. Only works when deterministic=False. Each sequence will be sampled based on the temperature and top_p parameters. Must be positive 1
match_protein bool Constrains predictions to only use codons that translate back to the exact input protein sequence. Only recommended when using high temperatures or error prone input proteins (e.g. not starting with methionine or having numerous repetitions) False

Returns: Union[DNASequencePrediction, List[DNASequencePrediction]] containing predicted DNA sequence(s) and metadata.

Installation

Install CodonTransformer via pip:

pip install CodonTransformer

Or clone the repository:

git clone https://github.com/adibvafa/CodonTransformer.git
cd CodonTransformer
pip install -r requirements.txt

The package requires python>=3.9, supports all major operating systems, and takes about 10-30 seconds depending on already installed requirements, availabe here.


Finetuning CodonTransformer

To finetune CodonTransformer on your own data, follow these steps:

  1. Prepare your dataset

    Create a CSV file with the following columns:

    • dna: DNA sequences (string, preferably uppercase ATCG)
    • protein: Protein sequences (string, preferably uppercase amino acid letters)
    • organism: Target organism (string or int, must be from ORGANISM2ID in CodonUtils)

    Note:

    • Use organisms from the FINE_TUNE_ORGANISMS list for best results.
    • For E. coli, use Escherichia coli general.
    • DNA sequences should ideally contain only A, T, C, and G. Ambiguous codons are replaced with 'UNK' for tokenization.
    • Protein sequences should contain standard amino acid letters from AMINO_ACIDS in CodonUtils. Ambiguous amino acids are replaced according to the AMBIGUOUS_AMINOACID_MAP in CodonUtils.
    • End your DNA sequences with a stop codon from STOP_CODONS in CodonUtils. If not present, a 'UNK' stop codon will be addded in preprocessing.
    • End your protein sequence with _ or *. If either is not present, a _ will be added in preprocessing.
  2. Prepare training data

    Use the prepare_training_data function from CodonData to prepare training data from your dataset.

    from CodonTransformer.CodonData import prepare_training_data
    prepare_training_data('your_data.csv', 'your_dataset_directory/training_data.json')


  3. Run the finetuning script Execute finetune.py with appropriate arguments: (an example)

     python finetune.py \
        --dataset_dir 'your_dataset_directory/training_data.json' \
        --checkpoint_dir 'your_checkpoint_directory' \
        --checkpoint_filename 'finetune.ckpt' \
        --batch_size 6 \
        --max_epochs 15 \
        --num_workers 5 \
        --accumulate_grad_batches 1 \
        --num_gpus 4 \
        --learning_rate 0.00005 \
        --warmup_fraction 0.1 \
        --save_every_n_steps 512 \
        --seed 23

    This script automatically loads the pretrained model from Hugging Face and finetunes it on your dataset. For an example of a SLURM job request, see the slurm directory in the repository.

Handling Ambiguous Amino Acids

CodonTransformer provides a flexible system for handling ambiguous amino acids through the ProteinConfig class. By default, CodonUtils includes a predefined mapping for ambiguous amino acids, but users can customize this behavior:

from CodonTransformer.CodonUtils import ProteinConfig

# Configure protein preprocessing
config = ProteinConfig()
config.set('ambiguous_aminoacid_behavior', 'standardize_random')
config.set('ambiguous_aminoacid_map_override', {'X': ['A', 'G', 'S']})

# Run CodonTransformer
...

Options for ambiguous_aminoacid_behavior:

Users can override the default mapping with ambiguous_aminoacid_map_override.

Star History

[![Star History Chart](https://api.star-history.com/svg?repos=adibvafa/codontransformer&type=Date)](https://star-history.com/#adibvafa/codontransformer&Date)


Key Features

CodonData

The CodonData subpackage offers tools for preprocessing NCBI or Kazusa databases and managing codon-related data operations. It includes comprehensive functions for working with DNA sequences, protein sequences, and codon frequencies, providing a robust toolkit for sequence preprocessing and codon frequency analysis across different organisms.

Overview

This subpackage is suitable for:

Available Functions

CodonPrediction

The CodonPrediction subpackage is an essential component of CodonTransformer, used for tokenizing input, loading models, predicting DNA sequences, and providing helper functions for data processing. It offers a comprehensive toolkit for working with the CodonTransformer model, covering tasks from model loading and configuration to various types of codon optimization and DNA sequence prediction.

Overview

This subpackage contains functions and classes that handle the core prediction functionality of CodonTransformer. It includes tools for working with the BigBird transformer model, tokenization, and various codon optimization strategies.

Available Functions and Classes

CodonEvaluation

The CodonEvaluation subpackage offers functions for calculating evaluation metrics related to codon usage and DNA sequence analysis, used for assessing the quality and characteristics of DNA sequences, especially in codon optimization. It provides a comprehensive toolkit for evaluating DNA sequences and codon usage, performingng genetic data analysis within the CodonTransformer package.

Overview

The CodonEvaluation module includes functions to compute metrics such as Codon Adaptation Index (CAI)/Codon Similarity Index (CSI) weights, GC content, codon frequency distribution (CFD), %MinMax, sequence complexity, and sequence similarity. These metrics are valuable for analyzing and comparing DNA sequences across different organisms.

Available Functions

CodonUtils

The CodonUtils subpackage contains constants and helper functions essential for working with genetic sequences, amino acids, and organism data in the CodonTransformer package. It provides tools for genetic sequence analysis, organism identification, and data processing, forming the foundation for many core functionalities within the CodonTransformer package.

Contents

Constants

Classes

Functions

CodonJupyter

The CodonJupyter subpackage offers Jupyter-specific functions for displaying interactive widgets, facilitating user interaction with the CodonTransformer package in a Jupyter notebook environment. It improves the user experience by providing interactive and visually appealing interfaces for input and output.

Overview

This subpackage can be used for:

Classes and Functions

Usage

Checkout our Google Colab Notebook for an example use case!

Contribution

We welcome contributions to CodonTransformer! Please fork the repository and submit a pull request. For major changes, please open an issue first to discuss what you would like to change.

Citation

If you use CodonTransformer or our data in your research, please cite our work:

@article {Fallahpour2024.09.13.612903,
    author = {Fallahpour, Adibvafa and Gureghian, Vincent and Filion, Guillaume J. and Lindner, Ariel B. and Pandi, Amir},
    title = {CodonTransformer: a multispecies codon optimizer using context-aware neural networks},
    elocation-id = {2024.09.13.612903},
    year = {2024},
    doi = {10.1101/2024.09.13.612903},
    publisher = {Cold Spring Harbor Laboratory},
    URL = {https://www.biorxiv.org/content/early/2024/09/13/2024.09.13.612903},
    eprint = {https://www.biorxiv.org/content/early/2024/09/13/2024.09.13.612903.full.pdf},
    journal = {bioRxiv}
}