RosettaCommons / protein_generator

Joint sequence and structure generation with RoseTTAFold sequence space diffusion
https://huggingface.co/spaces/merle/PROTEIN_GENERATOR
MIT License
210 stars 30 forks source link

`--symmetry` option seems more like a suggestion than a constraint. #5

Closed broomsday closed 1 year ago

broomsday commented 1 year ago

I've been using the --symmetry option and I'm not seeing truly symmetric sequence generation. For instance, below is an output of the first timesteps of a run with the relevant part of the input command shown:

python ./inference.py --contigs A15-64,7,A215-264,7,A415-464,7,A615-664,7,A815-864,7,A1015-1064,7,A1215-1264,7,A1415-1464,7 --symmetry 8

using sampler: default
Loading model checkpoint...
Successfully loaded model checkpoint
Input sequence symmetry 8
Generating sample 000000 ...
/home/broom/.conda/envs/proteingenerator/lib/python3.8/site-packages/dgl/backend/pytorch/tensor.py:445: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  assert input.numel() == input.storage().size(), (
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAPGGAGGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGAGAGGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGGAAGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAPPGAGGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGAGAAGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGGAGGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGAAGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAG
    TIMESTEP [25/25]   |   current PLDDT: 0.9536   <<  >>   best PLDDT: 0.9536
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAPGALAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGALGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGALGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGALGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGALGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGALGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGAGPAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAP
TIMESTEP [24/25]   |   current PLDDT: 0.9331   <<  >>   best PLDDT: 0.9536

I appreciate the sequences are a bit annoying with all the alanine, but given the contigs are a 50mer repeated unit with 7mer generated linkers with --symmetry 8, we can decompose that last generated sequence into:

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAPGALA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGALG
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGALG
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGALG
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGALG
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGALG
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGAGP
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAP

So the generated bits at the end are AAAGALG for 5/8 generated, but somehow I expected that would be true for all 8. Am I confused about the intention of the --symmetry option?

0merle0 commented 1 year ago

Hey are you providing a PDB input? There can be some bugs in how symmetry gets used with contigs. If you could send me the command and pdb you used I will investigate.

broomsday commented 1 year ago

Thanks.

Yes I am using a PDB. I've attached the PDB here, and the command is below (I changed the input and symmetry a bit to create a less memory intensive and faster to compute variant). BTW, I tested this with and without the symmetry option, and as you can see from the alignment of individual repeats the symmetry option is definitely doing something.

Is it possible that the symmetry option forces the logits to be symmetric but when a final sequence is taken for output it still samples at each position individually?

If the above is what's happening are there arguments that can be passed to avoid this behaviour? I noticed for instance argmax_seq and clamp_seqout which by their names look like the kind of things that might do what I'm imagining, but I'm not sure where to get documentation for any of these.

Command used for symmetry and resulting design sequence alignment of individual repeats:

python ./inference.py --contigs A27-52,32,B27-52,32,C27-52,32,D27-52,32,E27-52,32 --symmetry 5 --num_designs 1 --pdb github_test_case_5mer.pdb --out test/symmetry --save_best_plddt

symmetry

Running the same command but without the --symmetry option gives a result that has clearly lower symmetry in the generated residues:

no_symmetry

Input PDB file attached (gzipped for uploading requirements): github_test_case_5mer.pdb.gz

0merle0 commented 1 year ago

Hey thanks for all the detail, I really appreciate it and it helps a lot. So you are spot on about the way we were handling the output logit space. While at low T with hardly any noise added the model is primed to predict the symmetric sequence, it is not garunteed and since we use the argmaxed output from the model this can result in a non-symmetric sequence like you found. To fix this I added a function that will take the output from a symmetric diffusion trajectory, it will symmetrize the argmaxed sequence and pass it through the network once more treated as a structure prediction task. This will ensure that the backbone and sidechain representations match the seqeuence and are symmetric as they should be. The new function I added is in utils/sampler.py and is called predict_final_symmetric thanks for reaching out and I hope this helps, if you have any other questions let me know!

broomsday commented 1 year ago

Thanks so much for that quick work, I very much appreciate being able to use this tool!

While the fix does now enforce symmetry as expected, there seems to be a minor bug with a potentially large impact in that the new final pass no longer obeys the contigs in terms of e.g. which positions should have structure fixed, and instead it seems like all positions can have structure diffused at that final symmetric pass.

The example I initially gave doesn't suffer from this, maybe because the input structure is already something protein_generator likes, but in a different test case it's quite obvious that even positions supplied in the contigs as e.g. A1-50 now have their structure modified, which was not the case prior to the bug fix.

I haven't used the inpaint_str_tensor argument but will attempt to use that and see if explicitly setting the intended positions as not-masked will help.

Happy to send along an example and command where this is particularly obvious if you need a test case.

0merle0 commented 1 year ago

Ah this could be due to how I set up the templating, if you can send me an example with the problem you are seeing I'll be happy to take a look, I will be away next week but I can try and find some time to play around with it tomorrow

broomsday commented 1 year ago

Great!

OK, this example is a bit ugly, but it represents what I'm actually trying to do and where the problem I'm now encountering lies.

The input is a "pore" made up of individual helices (e.g. a homo-oligomer) that I'm trying to link together into a monomer by generating the linkers with protein_generator. A problem I initially had is that protein_generator would (understandably) place the linkers INSIDE the pore void space. Since I want that space maintained for functional reasons, and because I wasn't sure how, or if, structural potentials could be added, I made a hacky solution by just adding a bunch of disembodied glycine residues inside the pore and set the contigs such that they would be fixed. I've attached the input PDB, and a picture of the structure below.

input input.pdb.gz

The command below is a bit convoluted owing to the contigs needing to specify my aforementioned hacky pore solution. The first part is just generating the linkers around the helices which have 10-fold symmetry, but then I need to put in those glycines that are going to fill the pore and stop the linkers being generated in that space. So I add each glycine individually as its own chain and ensure the total number is such that I can do 10-fold + 1 symmetry (i.e. 11-fold symmetry) since I don't really care what happens to the sequence of the pore-filling glycines which will be removed (I know this is a bit ugly, but it seems to be a legit way to use the contigs to accomplish my goal). The command is below:

python ./inference.py --contigs A184-206,30,A384-406,30,A584-606,30,A784-806,30,A984-1006,30,A1184-1206,30,A1384-1406,30,A1584-1606,30,A1784-1806,30,A1984-2006,30,A2184-2184,0 B1-1,0 B2-2,0 B3-3,0 B4-4,0 B5-5,0 B6-6,0 B7-7,0 B8-8,0 B9-9,0 B10-10,0 B11-11,0 B12-12,0 B13-13,0 B14-14,0 B15-15,0 B16-16,0 B17-17,0 B18-18,0 B19-19,0 B20-20,0 B21-21,0 B22-22,0 B23-23,0 B24-24,0 B25-25,0 B26-26,0 B27-27,0 B28-28,0 B29-29,0 B30-30,0 B31-31,0 B32-32,0 B33-33,0 B34-34,0 B35-35,0 B36-36,0 B37-37,0 B38-38,0 B39-39,0 B40-40,0 B41-41,0 B42-42,0 B43-43,0 B44-44,0 B45-45,0 B46-46,0 B47-47,0 B48-48,0 B49-49,0 B50-50,0 B51-51,0 B52-52,0 --num_designs 1 --out ./output --save_best_plddt --pdb ./input.pdb --symmetry 11 --T 25

When I do this with the previous commit (b678056) I get the expected result where the new linkers have been generated OUTSIDE the pore and the input structure has not been altered (the pore-filling glycines move SLIGHTLY, which I presume is because, while they are fixed during diffusion of the coordinates, they still have their final structure determined by a call to RoseTTAFold?). That image below.

commit_b678056

By contrast, with the new symmetry-bug fix commit (which indeed does fix the sequence symmetry as intended) it seems that the input structure, which I presume the contigs are specifying to have fixed/unmasked structures, are dramatically altered. That image below.

commit_dee901d

I've run these several times (both commits) and the results are quite consistent (though one time with the new commit I did get a result where the pore-filling glycines were moved quite far from input and the rest of the structure was sensible, though with the linkers inside the pore so not usable not fitting what I would expect from the contigs).

In a previous message I mentioned that I would try inpaint_str_tensor in the hopes that it would over-ride whatever is happening and keep the intended residues unmasked structurally, but I couldn't quite figure out how to use that argument on the command-line, since it interpreted everything I entered as a string. I'm guessing I'd need to provide that as part of e.g. args.json and then pass that to inference.py?

0merle0 commented 1 year ago

Hey thanks for all the detail really appreciate it! I spent some time looking through this yesterday, and was able to replicate what you saw here, I think some of this might be due to the fact that although the backbone looks reasonable the sequence does not seem to fit, it was mostly just alanines. You could just redesign a symmetric sequence on the backbone with an inverse folding model like protein MPNN. As for the inpaint_str_tensor that is actually a deprecated argument, sorry about that, I went through and tried to clean up the deprecated args to reduce confusion. And additionally I added an arg predict_symmetric which can be used to enforce symmetry and predict the structure after the last diffusion step (it makes the fix made before optional). I'm going to be on holiday for the next week, but I'm happy to hop on zoom some time and chat if you wanna talk through this in more detail. Hope this helps for now!

broomsday commented 1 year ago

Thanks for the effort. I'll leave the message below for when you return next week:

I likely just don't understand what the inference pipeline actually looks like in terms of how the unmasked structures are treated, but you have a good point about my inputs being wonky. I was only partway through developing a little workflow to generate these inputs and seems like it makes sense to finish that off and see how they behave before trying to troubleshoot further.

Once I have a better sense of how things are behaving with improved input I'd definitely take you up on the zoom offer, as I have a number of questions about various input options.

In the meantime, one key one I wanted to test were the inpaint_str and/or inpaint_seq arguments (the non-deprecated ones), is there anywhere I can see an example of what the input to that looks like?

If there are no examples, and I had contigs that looks like e.g. A1-10,10,A15-19 and I want to mask the seq/str of those final 5 residues of my input, does the argument expect: a) contigs, e.g. A15-19 b) residue numbers/indices as in the final generated output, e.g. 21-25 c) residue numbers/indices as in the input, e.g. 11-15

0merle0 commented 1 year ago

The inpaint_str and inpaint_seq flags can be used just like the contigs to preserve the seq/str in that region and allow the model to redesign. I'm back this week, so if you would like to hop on zoom and discuss further you can reach out to me here: jgershon@uw.edu