Closed fabiotrovato closed 8 months ago
For keeping residues unchanged during sequence design, try the design_selection
argument, which accepts either selection strings or mask tensors. For instance:
chroma = Chroma()
protein = Protein("5SV5", device="cuda")
chroma.design(protein, design_selection="not resid 1-10") # keeps residues 1-10 fixed
You can use design_selection
with chroma.sample
as well, and it will get passed to the underlying sequence design step.
What you wanted to do requires a two stage process.
chroma._sample
with SubstructureConditioner
. The SubstructureConditioner
can take coordinate clamping mask as shown in our example notebook, so you can regenerate part of 1XYZ with [1,2,3,10]
masked. chroma.design
with the same residue constraint you desire. See #5 for example usage for sequence masking.Thanks and let us know if you have more questions.
Thanks for the directions! I am not sure I fully understand how to apply them to my case. Suppose that protein 1XYZ has three chains, each with residues numbered from 1 to 100, according to the pdb (not sure how the residues in the pdb are mapped to chroma).
What I want to achieve is:
I was playing around with the notebook ChromaDemo.ipynb. This is the code that I have come up with for keeping the substructure unchanged.
pdb_id = "1XYZ"
chain1_length = 100
chain2_length = 100
chain3_length = 100
protein = Protein.from_PDBID(pdb_id, canonicalize=True, device=device)
X, C, _ = protein.to_XCS()
selection_string = "resid 1-300"
residues_to_design = plane_split_protein(X, C, protein, 0.5).nonzero()[:, 1].tolist()
protein.sys.save_selection(gti=residues_to_design, selname=selection_string)
struct_conditioner = conditioners.SubstructureConditioner(
protein, backbone_model=chroma.backbone_network, selection=selection_string
).to(device)
conditioner = struct_conditioner
infilled_protein, trajectories = chroma.sample(
chain_lengths=[chain1_length, chain2_length, chain3_length],
protein_init=protein,
conditioner=conditioner,
langevin_factor=4.0,
langevin_isothermal=True,
inverse_temperature=8.0,
steps=500,
sde_func="langevin",
full_output=True,
)
I ignore how the pdb residues are mapped internally in chroma, so my first question is: do I have to use selection_string = "resid 1-300"
or selection_string = "resid 1-100"
?
Regarding the sequence masking, if I had one chain and residues [1,2,3,10] to mask, I would do:
design_mask = torch.Tensor([1] * 3 + [0] * 6 + [1] * 1 + [0] * 90)[None].cuda()
protein = chroma.sample(chain_lengths=[chain1_length], design_selection=design_mask)
print( protein.sequence() )
Is the code snippet ^^^ correct for the case of 1 chain?
I am not sure how I should modify the above snippet for the case of 3 chains, since I ignore how the pdb residues are mapped to chroma. Can you please clarify?
Best, Fabio
I have a working script that exactly does the following jobs:
@wujiewang What you wanted to do requires a two stage process.
- Use
chroma._sample
withSubstructureConditioner
. TheSubstructureConditioner
can take coordinate clamping mask as shown in our example notebook, so you can regenerate part of 1XYZ with[1,2,3,10]
masked.- Then, use
chroma.design
with the same residue constraint you desire. See Doc on design_selection in chroma.design #5 for example usage for sequence masking.Thanks and let us know if you have more questions.
chroma = Chroma()
protein = Protein('3HSF.pdb', device='cuda:0') # NOTE: change to your PDB file
residues_to_design = list(range(67,84)) # NOTE: change to your desire region
protein.sys.save_selection(gti=residues_to_design, selname="infilling_selection")
str_conditioner = conditioners.SubstructureConditioner(
protein=protein,
backbone_model=chroma.backbone_network,
selection = 'not namesel infilling_selection').to('cuda:0')
infilled_protein = chroma._sample( # NOTE: use `chroma._sample` instead of `chroma.sample`, former keeps the aa sequence fixed
protein_init=protein,
conditioner=str_conditioner,
langevin_factor=4.0,
langevin_isothermal=True,
inverse_temperature=8.0,
sde_func='langevin',
steps=500)
infilled_protein.sys.save_selection(gti=residues_to_design, selname="infilling_selection")
infilled_protein = chroma.design(infilled_protein, design_selection='namesel infilling_selection')
display(infilled_protein)
That is it.
Besides, I thought conditioners.SubsequenceConditioner
should be suitable for this job, but it seems not working.
Thanks @NatureGeorge for this nice example! We should add this to Chroma cookbook.
@fabiotrovato you might want to follow this, note that you need to separately use ._sample()
and .design()
.
The SubsequenceConditioner is a classifier guidance style conditioning, therefore it is doing a type of "soft" sequence conditioning. @NatureGeorge 's script will enforce a hard conditioning on the model directly.
Hi @NatureGeorge and @wujiewang , thanks for your examples and explanations. What I was asking is a bit more complex and I am not sure I fully understand if the example by @NatureGeorge applies to my case.
Please see my last post. A brief summary of that post:
Thanks again.
- My protein has three chains.
You can just use design mask if that is easier. If you want to clamp residue [1,2,3,10], you just set the corresponding residue to 1. You can also just selection string but I am less familiar with grammar.
example:
import torch
complex_mask = []
for _ in range(3):
mask = torch.zeros(1, 200).to(torch.long)
mask[:, torch.LongTensor([1, 2, 3, 10])] = 1
complex_mask.append(mask)
compex_mask = torch.cat(complex_mask, dim=-1)
- For any of the three chains, the residues for structure (1 to 100) and sequence ([1,2,3,10]) conditioning are different. Additionally, I want to specify the same residues for structure and sequence conditioning for all 3 chains.
I understand that the mask will be different for sequence and structure conditioning. So you can just specify the structure conditioning (1 to 100) and sequence conditioning ([1,2,3,10]) separately the same way @NatureGeorge did. It is probably easier if you just specify the mask as a binary long Tensor.
Thanks for the feedback! let us know if this makes more sense.
Agreed with @wujiewang and thanks for the example @NatureGeorge!
@fabiotrovato, you can specify different selections for the structure step and the sequence step. These are the selection
argument of the SubstructureConditioner
constructor and the design_selection
argument of the chroma.design
method, respectively, in @NatureGeorge's example.
I'm a little unclear on something in your previous message, though:
the residues that should be structurally unchanged are 1 to 100 for chain 1. I want residues 1 to 100 to be structurally unchanged for chains 2 and 3, as well.
You're not trying to leave all the backbone coordinates the same and just redesign the sequence, right? That'd just be using the chroma.design
method alone.
Hi @aismail3-gnr8 My post was misleading since i described the particular case of leaving unchanged all backbone atoms and performing sequence design on a subset of residues, but that's not the general case i have in mind. In the general case, i only want to condition the structure of a subset of residues (which can be different from the sequence mask).
Hope this clarifies a bit more my intentions. Thanks!
Great! In this case, you can just modify @NatureGeorge's very nice example above. To be explicit, here's a snippet that redesigns the structure of a subset of one chain and the sequence of a subset of another chain.
protein = Protein("1XYZ", device="cuda")
str_conditioner = conditioners.SubstructureConditioner(
protein=protein,
backbone_model=chroma.backbone_network,
selection="not (chain A and resid 30-60)",
).to("cuda")
new_protein = chroma._sample(
protein_init=protein,
conditioner=str_conditioner,
langevin_factor=4.0,
langevin_isothermal=True,
inverse_temperature=8.0,
sde_func="langevin",
steps=500,
)
new_protein = chroma.design(new_protein, design_selection="chain B and resid 30-60")
Here's a check of the sequence redesign. Note that sometimes you get the same residue as in the original protein.
X_old, C_old, S_old = protein.to_XCS(all_atom=True)
X_new, C_new, S_new = new_protein.to_XCS(all_atom=True)
# C_old.abs() == 2 selects chain B
# in the tensor below, index 29 corresponds to resid 30
torch.isclose(S_old, S_new)[C_old.abs() == 2][19:39]
To check the structure redesign, first we need to standardize the origin. To do this, we translate both structures by the location of a particular undesigned residue so that it's at the origin in both structures. Then, we can check some coordinates.
# C > 0 selects residues for which we have structural information
X_old_standardized = X_old - X_old[0,25,0].expand(X_old.shape) * (C_old > 0)[:, :, None, None].expand(X_old.shape)
X_new_standardized = X_new - X_new[0,25,0].expand(X_new.shape) * (C_new > 0)[:, :, None, None].expand(X_new.shape)
# residues which have been moved, should get Falses
print(torch.isclose(X_old_standardized[0,29:39,0,0], X_new_standardized[0,29:39,0,0]))
# residues which have not been moved, should get Trues
print(torch.isclose(X_old_standardized[0,400:410,0,0], X_new_standardized[0,400:410,0,0]))
I hope this helps! Please let us know if you have any more questions.
Thanks @aismail3-gnr8 (and everyone else) I tried your code on my protein and the results are great.
selection="not (chain A and resid 30-60)"
correctly leaves unchanged (or nearly so) the backbone of all residues but residues 30 to 60 of chain A. The latter have a different conformation but same sequence, as intended.
new_protein = chroma.design(new_protein, design_selection="chain B and resid 30-60")
has the effect to leave the sequence of chains A and C unchanged. The sequence changes occur at chain B, residues 30 to 60.
Thanks for your help, Fabio
Hi,
I would like to use Chroma for the following experiment. Suppose I have protein pdb 1XYZ. I would like to condition my protein design on (1) the structure and (2) the sequence. More specifically:
I would like to be able to specify protein residues that should be structurally unchanged.
I also would like to specify regions of the protein sequence that should be unchanged. For example, let's assume that the residues at positions [1,2,3,10] are the residues that I do not want Chroma to "mutate" to a different type (i.e. they have to be the same residue types as in 1XYZ).
I know how to condition on the structure, as explained in one of the notebooks. Can anyone provide an example of how to tell Chroma to not change the residue identity for residues at positions [1,2,3,10]?
Thank you for your support and great work, Fabio