dauparas / ProteinMPNN

Code for the ProteinMPNN paper
MIT License
1.03k stars 305 forks source link

What constitutes a "low quality" backbone? #61

Open bwllc opened 1 year ago

bwllc commented 1 year ago

I am attempting to follow the work flow recommended in the RFDiffusion paper, https://www.biorxiv.org/content/10.1101/2022.12.09.519842v2.

I am obtaining low amino acid sequence diversity in my ProteinMPNN outputs. My problem is not as severe as the one shown in an earlier reported issue (https://github.com/dauparas/ProteinMPNN/issues/46), but it is problematic. Here is a typical example. I am executing ProteinMPNN as shown in ProteinMPNN/examples/submit_example_1.sh.

MIYKHAGYYNAKKGKGKGYTFSTGAKGKGYTKRFKKFSVGKGKATDKETLRAMLTLGGIIFEIDKKKKNKWKGYSTDKGLTAGYSTGKGTKALGYQITPNFGVGYAYNKKPYFGVSYQTKDGSVGVGYNFGLRIVSVSYGNPKTGKGAGYSYKA
{ A : 6.5%
  C : 0.0%
  D : 2.6%
  E : 1.3%
  F : 4.5%
  G : 18.2%
  H : 0.6%
  I : 3.9%
  K : 18.2%
  L : 3.9%
  M : 1.3%
  N : 3.9%
  P : 1.9%
  Q : 1.3%
  R : 1.9%
  S : 5.8%
  T : 8.4%
  V : 4.5%
  W : 0.6%
  Y : 10.4% }

There is a surprisingly large number of G and K residues. I also wonder about the high abundance of Y. The calculated isoelectric point is 10.08. I generated 10 candidate sequences from this particular structure. They were all pretty similar to this one.

A response to the earlier issue report was as follows:

Hello! This might happen if the model is uncertain about the prediction, or the input backbone is of low quality. You could try adding negative alanine bias.

Originally posted by @dauparas in https://github.com/dauparas/ProteinMPNN/issues/46#issuecomment-1497947341

I can of course attempt to apply negative biases to certain amino acids, as recommended in the earlier post. Before I do this, I would like to ask whether there are any criteria we can use to measure the "quality" of input backbones.

My PDB input files are being generated by RFDiffusion. I specify a partial scaffold, and RFDiffusion hallucinates the rest. At least in PyMol, the secondary structures of the RFDiffusion output files look reasonable. The automated secondary structure assignment algorithm in PyMol is identifying regions of alpha helix and beta sheet. That doesn't mean that I don't have issues with my RFDiffusion outputs, but I don't know what to look for.

Thanks for any information you can provide.

FAOlivieri commented 1 year ago

I am having the exact same issue. Even parting from structures from different proteins (a coiled-coil dimer and a helix tetramer) I get sequences that are near 30% K or E.

MattMcPartlon commented 1 year ago

@bwllc I know this is a little late, but I've seen this happen when intra-residue geometry is not ideal (my own method, AttnPacker, does this). It's my guess that conserved bond lengths and angles are off.

If you have access to Rosetta, you can run relax with coordinate constraints to fix the geometry while minimizing the RMSD between pre-relaxed and relaxed structures. There is also the [Idealize protocol] (https://www.rosettacommons.org/docs/latest/scripting_documentation/RosettaScripts/Movers/movers_pages/IdealizeMover) which is designed for exactly this, but I haven't tried it.

As a first step, you can try running inference with the v_48_020.pt model first. If the distribution of AA types looks better, then that's a good indication that this is your issue.

GL

bwllc commented 1 year ago

Thanks for your reply, @MattMcPartlon.

I think that you are saying that a computational dynamics, force-field relaxation step sometimes needs to be applied to the output of RFDiffusion before passing it to ProteinMPNN. Do I understand that correctly?

If Rosetta has been open-sourced, I can use its minimizer.

I already have GROMACS, and it also has a relaxation algorithm which I can investigate. I'm not sure if it behaves differently than the Rosetta minimizer. I'm not sure whether that would matter. I could probably specify constraints on some atoms in GROMACS, but that sounds fussy, and I'd prefer to avoid that if I can.

Please let me know if I'm barking up the wrong tree. Thanks.

MattMcPartlon commented 1 year ago

@bwllc That's exactly what I mean :).

Before spending too much time on this, you can check (for example) that the consecutive C-alpha atoms are at distance 3.8A +/- 0.1. If you see distances outside of this range, then relaxing with a forcefield should solve your problem.

I only recommend rosetta's minimizer because it can explicitly minimize RMSD between relaxed and input structures. GROMACS or AMBER should also work fine. Good luck!