BiosecSFA / CAMEOX

CAMEOX: CAMEOS eXtended
GNU Affero General Public License v3.0
2 stars 0 forks source link

CAMEOX (CAMEOs eXtended)


Overview

This repository contains code and data related to CAMEOX (CAMEOs eXtended), a parallelized extension of CAMEOS (Constraining Adaptive Mutations using Engineered Overlapping Sequences) developed by LLNL (Lawrence Livermore National Laboratory). The original CAMEOS software was developed by Tom Blazejewski at Wang Lab (Columbia University). CAMEOX is the computational core of the GENTANGLE pipeline for automated design of gene entanglements.

Installation

As part of GENTANGLE

The recommended installation method is as part of the GENTANGLE pipeline by cloning the GENTANGLE repository or, even better, by downloading the Singularity container as this eases the process of setting all the many requirements of CAMEOX, and also the DATANGLE repository to provide data examples and templates. Please see this link for details on these approaches.

Only CAMEOX source code

git clone https://github.com/BiosecSFA/cameox.git

CAMEOX improvements over CAMEOS

The main improvements in CAMEOX relative to CAMEOS are:

Performance and Scalability

Flexibility and Customization

Output and Analysis

Usability

CAMEOX parameters

Format of the parameter file

CAMEOX improvements over CAMEOS have required some changes in the TSV input/parameters file from column 7 regarding CAMEOS. Each line in the file should now have the following columns:

  1. Output dir: relative base directory where the output directory will be created.
  2. Mark gene name: gene ID string for \'mark\' gene; needed as a key for looking up some values associated with genes in files.
  3. Deg gene name: gene ID string for the corresponding \'deg\' gene.
  4. Mark JLD file: relative path to mark gene JLD file.
  5. Deg JLD file: relative path to mark gene JLD file.
  6. Mark HMM file: relative path to mark gene HMM directory and .hmm file.
  7. Deg HMM file: relative path to deg gene HMM directory and .hmm file.
  8. Population size: number of seeds that will enter the optimization loop, i.e. number of individual HMM solutions to greedily optimize.
  9. Frame (placeholder): p1/p2/p3, but the entanglement frame depends on the order of the genes in the input (see subsection below for details).
  10. Relative change threshold: minimum threshold for the relative number of variants changing, used for setting a dynamic limit on the number of iterations; typical value for standard CAMEOX runs is 0, or very close.
  11. Host taxid: NCBI Taxonomic ID for the host of the entanglement, used by the host generalization subsystem (the default value is 562, for E. coli; see subsection below for details).
  12. Pseudolikelihoods weights for optimization choice, which should be one of the next options: equal, rand, close2mark, close2deg (see subsection below for details).

Example

Example of a single-line CAMEOX parameter file with Pseudomonas protegens Pf-5 (NCBI taxid: 220664) as host:

    output/ aroB_pf5    infA_pf5    jlds/aroB_pf5.jld   jlds/infA_pf5.jld   hmms/aroB_pf5.hmm   hmms/infA_pf5.hmm   20000   p1  0   220664

Entanglement frame

As indicated above in the input format, the frame parameter in the parameter/input file is a placeholder, both in CAMEOS and CAMEOX. The effective way to select the entanglement frame is via the order of the genes in the input. Using CAMEOS terminology, typically, the "mark" gene is the shorter gene and the "deg" gene is the longer gene. By inverting that order, the effective frame of entanglement regarding the longer gene is changed. CAMEOX is aware of the working entanglement frame and outputs that information at the start of any run to clarify the actual entanglement frame:

    Processing entanglement [shorter_prot]⥂[longer_prot] in frame [real_frame]

where [real_frame] can be either 5'3'F2 or 5'3'F3.

Host selection

As previously mentioned, CAMEOS codon optimization is hardwired for E. coli, while CAMEOX includes a generalized embedded codon optimization by reading from an external database. This database is composed by one TSV file for each organism used as host for the entanglements. Each filename follows the format CUT_{taxid}.tsv, where CUT stands for Codon Usage Table and taxid is the taxonomic identifier for the organism in the NCBI Taxonomy database. Each TSV file needs two columns: 'codon' for the codons and 'freq' for the frequencies. As an example, please see Pseudomonas protegens Pf-5 (NCBI taxid: 220664) CUT file. The DATANGLE repository also contains the E. coli (NCBI taxid: 562) CUT file direcly usable by CAMEOX.

In case that additional hosts are targeted, a quick method to get the CUT is to consult an online CoCoPUTs service, retrieve the CUT for the desired host with NCBI taxonomic identifier hostTaxId, and save it with the described format in the file CUT_{hostTaxId}.tsv, which should be placed in the root of CAMEOX data directory.

Pseudolikelihood weights for optimization

As indicated above in the input format, the last parameter indicates the pseudolikelihood (PLL) weights for optimization. Before the MRF optimization (main optimization loop), each gene of each pair of HMM seeds is assigned a weight. Within a pair, the weights sum 1.0 and indicate the relative importance of each gene PLL (as calculated by the respective MRF models) for the total pair score. The options for this parameter are the following:

Notes

Further documentation

License

CAMEOX is part of and released as part of the GENTANGLE pipeline (LLNL-CODE-845475) and is distributed under the terms of the GNU Affero General Public License v3.0 (see LICENSE). CAMEOX is developed upon CAMEOS, which was released under a MIT license (see LICENSE-CAMEOS).

SPDX-License-Identifier: AGPL-3.0-or-later

Funding

This work is supported by the U.S. Department of Energy, Office of Science, Office of Biological and Environmental Research, Lawrence Livermore National Laboratory Secure Biosystems Design SFA “From Sequence to Cell to Population: Secure and Robust Biosystems Design for Environmental Microorganisms”. Work at LLNL is performed under the auspices of the U.S. Department of Energy at Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.


If you use CAMEOX in your research, please cite the following papers. Thanks!

GENTANGLE: integrated computational design of gene entanglements\ Jose Manuel Martí, Chloe Hsu, Charlotte Rochereau, Tomasz Blazejewski, Hunter Nisonoff, Sean P. Leonard, Christina S. Kang-Yun, Jennifer Chlebek, Dante P. Ricci, Dan Park, Harris Wang, Jennifer Listgarten, Yongqin Jiao, Jonathan E. Allen\ bioRxiv 2023.11.09.565696; doi: https://doi.org/10.1101/2023.11.09.565696

Blazejewski T, Ho HI, Wang HH. Synthetic sequence entanglement augments stability and containment of genetic information in cells. Science. 2019 Aug 9;365(6453):595-8. https://doi.org/10.1126/science.aav5477