generatebio / chroma

A generative model for programmable protein design
Apache License 2.0
659 stars 84 forks source link

Length-variable design with substructure conditioner #30

Closed Immortals-33 closed 9 months ago

Immortals-33 commented 9 months ago

Dear Chroma team,

Thanks for this amazing work and the open-source of the wonderful code!

I've been playing around with Chroma for a while and having some tests on some of the conditioners. Thanks for the example you provided and I'm wondering if the substructure conditioner (more specific, the infilling task) support length-variable design, which is, similar to something like motif-scaffolding around a pre-defined substructure.

I tried select some different motif from different chains and took the substructure alone, like:

protein = Protein('./input.pdb', canonicalize=True, device=device)
X, C, S = protein.to_XCS()
chain_B = Protein(X, C==2, S) # Select chain B as predefined substructure
X2, C2, S2 = chainB.to_XCS()
...... ( The rest of example code)

This is rather a dumb try since I found if I took chain_B object as predefined substructure and then sample a length-variable protein with chroma.sample(..., protein_init=chain_B, chain_lengths=[$length],...), it would raise an error that the

I went through the issue list and found a similar situation brought by @gha2012 https://github.com/generatebio/chroma/issues/24#issuecomment-1854231143_ , but it seems there's some unexpected behaviour on the sampled backbones.

So I'm wondering if the length-variable design task suitable for the Chroma conditioner architecture. If you could provide some views or examples under this circumstance, I would be very grateful.

Many thanks!

wujiewang commented 9 months ago

Hey, thanks for your interests!

Can you say more about the error you encounter?

It should be possible to make the infilled chain to have different lengths than the template proteins, you can specify the SubstructureConditioner, and initialize the chains that need to be infilled. The polymer prior will automatically initialized a random polymer for sampling conditioned on the motif using the Gaussian conditional formula.

Immortals-33 commented 9 months ago

Hi and thanks for the timely response!

Here I put the code with error encountered:

# Configure Substructure Conditioner
from chroma.utility.chroma import plane_split_protein
protein = Protein('./input.pdb', canonicalize=True, device=device)

X, C, S = protein.to_XCS()
residues_to_fix = (C2 == 2).nonzero(as_tuple=True)[1].tolist() # Fix all residues in chain B
protein.sys.save_selection(gti=residues_to_fix, selname="infilling_selection")

conditioner = conditioners.SubstructureConditioner(
        protein=protein,
        backbone_model=chroma.backbone_network,
        selection = 'namesel infilling_selection').to(device)

# Draw a Sample
torch.manual_seed(2)
infilled_protein = chroma.sample(
             chain_lengths=[100],
             conditioner=conditioner,
             langevin_factor=4.0,
             langevin_isothermal=True,
             inverse_temperature=8.0,
             sde_func='langevin',
             steps=500)

And I get the error messageRuntimeError: Expected size for first two dimensions of batch2 tensor to be: [1, 496] but got: [1, 400].. I think this's because the original protein has a length of $496 / 4 = 124$ but I specify chain_lengths=[100] when sampling, which are not compatible.

Based on your advice, I went to look at the SubstructureConditioner class and, if I didn't take your words mistakenly, it is the Protein object that should be reinitialized under the length-variable circumstance. Now my question is:

  1. Is this Protein object need to be created outside the class first to get itself length-variable (like the issue I mentioned above did), or there's a more structuralized way to do so?
  2. Should I take the conditioned substructure region outside alone and get a new protein, or directly act on the input protein object?
wujiewang commented 9 months ago

Thanks for sharing the code. and Yes, What you said sounds right.

You first select a motif from a source protein that you want to fix and then attach to a dummy structure with the desired size that is different from the original protein. How you specify the coordinates does not matter, the SubstructureConditioner will just respect the motif and hallucinate the rest of the structure with the size and chain map of that structure. This initialized protein will be the input to SubstructureConditioner, and fully specifies the constraint.

In your case the protein will have the motif for chain 2, and you can specify arbitrary coordinates for the rest (random coordinate or zero) For sampling, you then don't need to specify chain_lengths anymore but just parse the conditioner.

Immortals-33 commented 9 months ago

I think I get the logic now and I'll try to use it to create my own protein using the modified infilling code. Thanks again for your detailed answers to my questions!