generatebio / chroma

A generative model for programmable protein design
Apache License 2.0
659 stars 84 forks source link

Example on how to condition on Sequence (and Structure) #37

Closed fabiotrovato closed 8 months ago

fabiotrovato commented 9 months ago

Hi,

I would like to use Chroma for the following experiment. Suppose I have protein pdb 1XYZ. I would like to condition my protein design on (1) the structure and (2) the sequence. More specifically:

I know how to condition on the structure, as explained in one of the notebooks. Can anyone provide an example of how to tell Chroma to not change the residue identity for residues at positions [1,2,3,10]?

Thank you for your support and great work, Fabio

aismail3-gnr8 commented 9 months ago

For keeping residues unchanged during sequence design, try the design_selection argument, which accepts either selection strings or mask tensors. For instance:

chroma = Chroma()
protein = Protein("5SV5", device="cuda")
chroma.design(protein, design_selection="not resid 1-10") # keeps residues 1-10 fixed

You can use design_selection with chroma.sample as well, and it will get passed to the underlying sequence design step.

wujiewang commented 8 months ago

What you wanted to do requires a two stage process.

Thanks and let us know if you have more questions.

fabiotrovato commented 8 months ago

Thanks for the directions! I am not sure I fully understand how to apply them to my case. Suppose that protein 1XYZ has three chains, each with residues numbered from 1 to 100, according to the pdb (not sure how the residues in the pdb are mapped to chroma).

What I want to achieve is:

I was playing around with the notebook ChromaDemo.ipynb. This is the code that I have come up with for keeping the substructure unchanged.

pdb_id = "1XYZ"  
chain1_length = 100
chain2_length = 100
chain3_length = 100

protein = Protein.from_PDBID(pdb_id, canonicalize=True, device=device)
X, C, _ = protein.to_XCS()
selection_string = "resid 1-300" 
residues_to_design = plane_split_protein(X, C, protein, 0.5).nonzero()[:, 1].tolist()
protein.sys.save_selection(gti=residues_to_design, selname=selection_string)
struct_conditioner = conditioners.SubstructureConditioner(
        protein, backbone_model=chroma.backbone_network, selection=selection_string
    ).to(device)

conditioner = struct_conditioner
infilled_protein, trajectories = chroma.sample(
    chain_lengths=[chain1_length, chain2_length, chain3_length],
    protein_init=protein,
    conditioner=conditioner,
    langevin_factor=4.0,
    langevin_isothermal=True,
    inverse_temperature=8.0,
    steps=500,
    sde_func="langevin",
    full_output=True,
)

I ignore how the pdb residues are mapped internally in chroma, so my first question is: do I have to use selection_string = "resid 1-300" or selection_string = "resid 1-100"?

Regarding the sequence masking, if I had one chain and residues [1,2,3,10] to mask, I would do:

design_mask = torch.Tensor([1] * 3 + [0] * 6 + [1] * 1  + [0] * 90)[None].cuda()
protein = chroma.sample(chain_lengths=[chain1_length], design_selection=design_mask)
print( protein.sequence() )

Is the code snippet ^^^ correct for the case of 1 chain?

I am not sure how I should modify the above snippet for the case of 3 chains, since I ignore how the pdb residues are mapped to chroma. Can you please clarify?

Best, Fabio

NatureGeorge commented 8 months ago

I have a working script that exactly does the following jobs:

@wujiewang What you wanted to do requires a two stage process.

  • Use chroma._sample with SubstructureConditioner. The SubstructureConditioner can take coordinate clamping mask as shown in our example notebook, so you can regenerate part of 1XYZ with [1,2,3,10] masked.
  • Then, use chroma.design with the same residue constraint you desire. See Doc on design_selection in chroma.design #5 for example usage for sequence masking.

Thanks and let us know if you have more questions.

chroma = Chroma()
protein = Protein('3HSF.pdb', device='cuda:0')      # NOTE: change to your PDB file
residues_to_design = list(range(67,84))             # NOTE: change to your desire region
protein.sys.save_selection(gti=residues_to_design, selname="infilling_selection")
str_conditioner = conditioners.SubstructureConditioner(
        protein=protein,
        backbone_model=chroma.backbone_network,
        selection = 'not namesel infilling_selection').to('cuda:0')

infilled_protein = chroma._sample(                  # NOTE: use `chroma._sample` instead of `chroma.sample`, former keeps the aa sequence fixed
             protein_init=protein,
             conditioner=str_conditioner,
             langevin_factor=4.0,
             langevin_isothermal=True,
             inverse_temperature=8.0,
             sde_func='langevin',
             steps=500)

infilled_protein.sys.save_selection(gti=residues_to_design, selname="infilling_selection")
infilled_protein = chroma.design(infilled_protein, design_selection='namesel infilling_selection') 

display(infilled_protein)

That is it.


Besides, I thought conditioners.SubsequenceConditioner should be suitable for this job, but it seems not working.

wujiewang commented 8 months ago

Thanks @NatureGeorge for this nice example! We should add this to Chroma cookbook.

@fabiotrovato you might want to follow this, note that you need to separately use ._sample() and .design().

The SubsequenceConditioner is a classifier guidance style conditioning, therefore it is doing a type of "soft" sequence conditioning. @NatureGeorge 's script will enforce a hard conditioning on the model directly.

fabiotrovato commented 8 months ago

Hi @NatureGeorge and @wujiewang , thanks for your examples and explanations. What I was asking is a bit more complex and I am not sure I fully understand if the example by @NatureGeorge applies to my case.

Please see my last post. A brief summary of that post:

Thanks again.

wujiewang commented 8 months ago
  • My protein has three chains.

You can just use design mask if that is easier. If you want to clamp residue [1,2,3,10], you just set the corresponding residue to 1. You can also just selection string but I am less familiar with grammar.

example:

import torch
complex_mask = []

for _ in range(3):
    mask = torch.zeros(1, 200).to(torch.long)
    mask[:, torch.LongTensor([1, 2, 3, 10])] = 1

    complex_mask.append(mask)

compex_mask = torch.cat(complex_mask, dim=-1)
  • For any of the three chains, the residues for structure (1 to 100) and sequence ([1,2,3,10]) conditioning are different. Additionally, I want to specify the same residues for structure and sequence conditioning for all 3 chains.

I understand that the mask will be different for sequence and structure conditioning. So you can just specify the structure conditioning (1 to 100) and sequence conditioning ([1,2,3,10]) separately the same way @NatureGeorge did. It is probably easier if you just specify the mask as a binary long Tensor.

Thanks for the feedback! let us know if this makes more sense.

aismail3-gnr8 commented 8 months ago

Agreed with @wujiewang and thanks for the example @NatureGeorge!

@fabiotrovato, you can specify different selections for the structure step and the sequence step. These are the selection argument of the SubstructureConditioner constructor and the design_selection argument of the chroma.design method, respectively, in @NatureGeorge's example.

I'm a little unclear on something in your previous message, though:

the residues that should be structurally unchanged are 1 to 100 for chain 1. I want residues 1 to 100 to be structurally unchanged for chains 2 and 3, as well.

You're not trying to leave all the backbone coordinates the same and just redesign the sequence, right? That'd just be using the chroma.design method alone.

fabiotrovato commented 8 months ago

Hi @aismail3-gnr8 My post was misleading since i described the particular case of leaving unchanged all backbone atoms and performing sequence design on a subset of residues, but that's not the general case i have in mind. In the general case, i only want to condition the structure of a subset of residues (which can be different from the sequence mask).

Hope this clarifies a bit more my intentions. Thanks!

aismail3-gnr8 commented 8 months ago

Great! In this case, you can just modify @NatureGeorge's very nice example above. To be explicit, here's a snippet that redesigns the structure of a subset of one chain and the sequence of a subset of another chain.

protein = Protein("1XYZ", device="cuda")
str_conditioner = conditioners.SubstructureConditioner(
    protein=protein,
    backbone_model=chroma.backbone_network,
    selection="not (chain A and resid 30-60)",
).to("cuda")

new_protein = chroma._sample(
    protein_init=protein,
    conditioner=str_conditioner,
    langevin_factor=4.0,
    langevin_isothermal=True,
    inverse_temperature=8.0,
    sde_func="langevin",
    steps=500,
)

new_protein = chroma.design(new_protein, design_selection="chain B and resid 30-60")

Here's a check of the sequence redesign. Note that sometimes you get the same residue as in the original protein.

X_old, C_old, S_old = protein.to_XCS(all_atom=True)
X_new, C_new, S_new = new_protein.to_XCS(all_atom=True)
# C_old.abs() == 2 selects chain B
# in the tensor below, index 29 corresponds to resid 30
torch.isclose(S_old, S_new)[C_old.abs() == 2][19:39]

To check the structure redesign, first we need to standardize the origin. To do this, we translate both structures by the location of a particular undesigned residue so that it's at the origin in both structures. Then, we can check some coordinates.

# C > 0 selects residues for which we have structural information
X_old_standardized = X_old - X_old[0,25,0].expand(X_old.shape) * (C_old > 0)[:, :, None, None].expand(X_old.shape)
X_new_standardized = X_new - X_new[0,25,0].expand(X_new.shape) * (C_new > 0)[:, :, None, None].expand(X_new.shape)
# residues which have been moved, should get Falses
print(torch.isclose(X_old_standardized[0,29:39,0,0], X_new_standardized[0,29:39,0,0]))
# residues which have not been moved, should get Trues
print(torch.isclose(X_old_standardized[0,400:410,0,0], X_new_standardized[0,400:410,0,0])) 

I hope this helps! Please let us know if you have any more questions.

fabiotrovato commented 8 months ago

Thanks @aismail3-gnr8 (and everyone else) I tried your code on my protein and the results are great.

selection="not (chain A and resid 30-60)" correctly leaves unchanged (or nearly so) the backbone of all residues but residues 30 to 60 of chain A. The latter have a different conformation but same sequence, as intended.

new_protein = chroma.design(new_protein, design_selection="chain B and resid 30-60") has the effect to leave the sequence of chains A and C unchanged. The sequence changes occur at chain B, residues 30 to 60.

Thanks for your help, Fabio