bwicky / oligomer_hallucination

MIT License
35 stars 5 forks source link

header

Oligomer hallucination with AlphaFold2

Design (hallucinate) cyclic-symmetric protein assemblies starting from only the specification of a homo-oligomeric valency.

Accompanying oligomer hallucination paper and zenodo archive of the initial release DOI.

The ProteinMPNN paper and code.

Getting started

  1. Clone repo (NB you will also need to have AlphaFold2 installed).
    git clone https://github.com/bwicky/oligomer_hallucination

NB all work reported in the oligomer hallucination paper was done with the initial release of AlphaFold2 (Jul 15, 2021):

commit 1109480e6f38d71b3b265a4a25039e51e2343368
  1. Create conda environment using the SE3.yml file.

    cd oligomer_hallucination
    conda env create -f SE3.yml
  2. Change the shebang in ./oligomer_hallucination.py to the location of your conda install.

    sed -i 's/\/software\/conda\/envs\/SE3\/bin\/python/<path_to_your_conda_env>/g' oligomer_hallucination.py
  3. Change paths to your AlphaFold2 install in ./modules/af2_net.py and ./modules/losses.py

    sed -i 's/\/projects\/ml\/alphafold\/alphafold_git\//<path_to_your_alphafol2_install>/g' ./modules/*.py
  4. If using tmalign and dssp based losses, you will also need to install these packages [TM-align, DSSP], and update the paths to their executables in ./modules/losses.py

modules/losses.py:    dssp_tuple = dssp_dict_from_pdb_file(pdbfile, DSSP="/home/lmilles/lm_bin/dssp")[0]
modules/losses.py:    p = subprocess.Popen(f'/home/lmilles/lm_bin/TMalign {template} {temp_pdbfile} | grep -E "RMSD|TM-score=" ', stdout=subprocess.PIPE, shell=True)
modules/losses.py:    p = subprocess.Popen(f'/home/lmilles/lm_bin/TMalign {template} {temp_pdbfile} -I {force_alignment} | grep -E "RMSD|TM-score=" ', stdout=subprocess.PIPE, shell=True)

Examples

will perform design of a homo-trimer, with each protomer being composed of 100 amino acids.

will perform design of a homo-hexamer, with each protomer being composed of 50 amino acids, and optimising for the dual_cylic loss.

will perform design of a monomeric proteins containing eight repeats, each 30 amino acids in length.

will perform design of a homo-dimer, starting from the sequence of Top7.

Outputs

A folder with the name specified by --out containing:

Features

Design (hallucination) is performed by MCMC search in sequence space, while optimizing (user-defined) losses composed of AlphaFold2 metrics, and/or geometric constraints, and/or secondary-structure definitions in order to match the design objective:

Options

optional arguments:
  -h, --help            show this help message and exit
  --oligo OLIGO         oligomer(s) definitions (comma-separated string, no space). Numbers and types of subunits (protomers), and design type
                        (positive or negative) specifying each oligomer. Protomers are defined by unique letters, and strings indicate oligomeric
                        compositions. The last character of each oligomer has to be [+] or [-] to indicate positive or negative design of that
                        oligomer (e.g. AAAA+,AB+). The number of unique protomers must match --L / --seq. Must be specified.
  --L L                 lengths of each protomer (comma-separated, no space). Must be specified if not using --seq or a .af2h config file.
  --seq SEQ             seed sequence for each protomer (comma-separated, no space). Optional.
  --out OUT             the prefix appended to the output files. Must be specified.
  --single_chain        this option will generate sequence-symmetric repeat proteins instead of oligomers by removing chain breaks (default: False).
  --exclude_AA EXCLUDE_AA
                        amino acids to exclude during hallucination. Must be a continous string (no spaces) in one-letter code format (default: C).
  --mutation_rate MUTATION_RATE
                        number of mutations at each MCMC step (start-finish, stepped linear decay). Should probably be scaled with protomer length
                        (default: 3-1).
  --select_positions SELECT_POSITIONS
                        how to select positions for mutation at each step. Choose from [random, plddt::quantile, FILE.af2h::quantile]. FILE.af2h needs
                        to be a file specifying the probability of mutation at each site. Optional arguments can be given with :: e.g. plddt::0.25
                        will only mutate the 25% lowest plddt positions (default: random).
  --mutation_method MUTATION_METHOD
                        how to mutate selected positions. Choose from [uniform, frequency_adjusted, blosum62, pssm] (default: frequency_adjusted).
  --loss LOSS           the loss function used during optimization. Choose from [plddt, ptm, pae, dual, cyclic, dual_cyclic, pae_sub_mat, pae_asym,
                        tmalign (requires --template), dual_tmalign (requires --template), aspect_ratio, frac_dssp, min_frac_dssp (requires
                        --dssp_fractions_specified), pae_asym_tmalign (in development), entropy (in development)]. Multiple losses can be combined as
                        a comma-separarted string of loss_name:args units (and weighed with --loss_weights).
                        loss_0_name::loss0_param0;loss0_param1,loss_1_name::[loss_1_configfile.conf] ... (default: dual).
  --loss_weights LOSS_WEIGHTS
                        if a combination of losses is passed, specify relative weights of each loss to the globabl loss by providing a comma-separated
                        list of relative weights. E.g. 2,1 will make the first loss count double relative to the second one (default: equal weights).
  --oligo_weights OLIGO_WEIGHTS
                        contribution of the loss of each oligomer to the global loss, provided as a comma-separted list of relative weights (default:
                        equal weights).
  --T_init T_INIT       starting temperature for simulated annealing. Temperature is decayed exponentially (default: 0.01).
  --half_life HALF_LIFE
                        half-life for the temperature decay during simulated annealing (default: 1000).
  --steps STEPS         number for steps for the MCMC trajectory (default: 5000).
  --tolerance TOLERANCE
                        the tolerance on the loss sliding window for terminating the MCMC trajectory early (default: None).
  --model MODEL         AF2 model (_ptm) used during prediction. Choose from [1, 2, 3, 4, 5] (default: 4).
  --amber_relax AMBER_RELAX
                        amber relax pdbs written to disk, 0=do not relax, 1=relax every prediction (default: 0).
  --recycles RECYCLES   the number of recycles through the network used during structure prediction. Larger numbers increase accuracy but linearly
                        affect runtime (default: 1).
  --msa_clusters MSA_CLUSTERS
                        the number of MSA clusters used during feature generation (?). Larger numbers increase accuracy but significantly affect
                        runtime (default: 1).
  --output_pae          output the pAE (predicted alignment error) matrix for each accepted step of the MCMC trajectory (default: False).
  --timestamp           timestamp output and every PDB written to disk with: %Y%m%d_%H%M%S_%f (default: False).
  --template TEMPLATE   template PDB for use with tmalign-based losses (default: None).
  --dssp_fractions_specified DSSP_FRACTIONS_SPECIFIED
                        dssp fractions specfied for frac_dssp loss as E(beta sheet), H(alpha helix), notEH(other) e.g. 0.8,None,None will enforce 80%
                        beta sheet; or 0.5,0,None will enforce 50% beta sheet, no helices (default: None).
  --template_alignment TEMPLATE_ALIGNMENT
                        enforce tmalign alignment with fasta file (default: None).

Example .af2h file

The following config file enables design at all positions set to 1 (equal probability of picking those sites for mutation), and disallow design at all positions that are set to 0.

>A
DEEQEKAEEWLKEAEEMLEQAKRAKDEEELLKLLVRLLELSVELAKIIQKTKDEEKKKELLEINKRLIEVIKELLRRLK
1,1,1,1,1,1,0,1,1,0,1,1,1,0,1,1,0,1,1,1,0,1,1,0,1,1,1,1,1,0,0,1,0,0,0,1,0,0,1,0,0,1,1,0,0,1,0,0,1,1,0,1,1,1,1,1,1,1,1,0,1,1,0,0,1,1,0,1,1,0,0,1,1,0,1,1,0,0,1
>B
QEELAELIELILEVNEWLQRWEEEGLKDSEELVKEYEKIVEKIKELVKMAEEGHDEEEAEEEAKKLKKKAEEILREAEKG
1,1,1,0,0,1,0,0,1,0,0,0,1,0,0,1,0,0,0,1,0,0,1,1,1,0,1,1,0,1,1,0,0,1,1,0,1,1,0,0,1,0,0,1,1,0,0,1,0,0,1,1,1,1,1,1,1,1,0,1,1,1,1,1,1,0,1,1,1,0,1,1,0,1,1,1,0,1,1,0

Example template alignment for tmalign loss

Remove the TER in the template PDB file. model1 (do not change the names) is the template given in --template, and model2 should have the length of the protomer to be designed. The example below will design a 130 amino acid protein with motifs placed at the N- and C-termini (the sequence given here is arbitrary). Do not change this order!

>model1
RSMSWDNEVAFN-----------------------------------------------------
----------------------------------------------------QHHLGGAKQAGAV

>model2
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

Citing this work

If you use the code, please cite:

@article {HAL2022,
    author = {B. I. M. Wicky  and L. F. Milles  and A. Courbet  and R. J. Ragotte  and J. Dauparas  and E. Kinfu  and S. Tipps  and R. D. Kibler  and M. Baek  and F. DiMaio  and X. Li  and L. Carter  and A. Kang  and H. Nguyen  and A. K. Bera  and D. Baker },
    title = {Hallucinating symmetric protein assemblies},
    journal = {Science},
    pages = {eadd1964},
    doi = {10.1126/science.add1964},
    URL = {https://www.science.org/doi/abs/10.1126/science.add1964},
}

Acknowledgements

This work was made possible by the following separate libraries and packages:

We thank all their contributors and maintainers!

Get in touch

Questions and comments are welcome: