Change protein mutation specification convention

dominicrufa commented 4 years ago

we need a more coherent way of specifying a list of allowed mutations than what is described here: https://github.com/choderalab/perses/blob/f679ffba41b21ec2ac69f14091dedaf819ec9deb/perses/rjmc/topology_proposal.py#L1792-L1801

jchodera commented 4 years ago

For the near term, I think we want a way to specify mutations that corresponds to how biologists might think about them. For standard amino acids, a ProteinMutationEngine that takes standard notation for mutations would be optimal:

allowed_mutations = [
   'L99A', # this is a single mutant, and we also are able to double-check that position 99 is actually a Leu first
   'L99A M102Q', # this is a double mutant 
   'M102Q', # another single mutant
   'L99A M102Q Q150I', # a triple mutant
]

Things does not cover:

Nonstandard amino acids, which don't have single-letter representations. We may want to do this for retro-inverso peptides or nonnatural amino acids or terminal groups (e.g. for peptides ordered from anaspec)
Post-translational modifications, such as pTyr, pThr, pSer, or methylated lysines
Covalent inhibitors, etc.
Peptoid libraries

For other applications, we probably want to design different classes, like PeptoidLibraryEngine, or PostTranslationalModificationEngine, or PeptideLibraryEngine, or PolymerReactionEngine that can do different things, like

PeptoidLibraryEngine could use standard peptoid naming nomenclature
PolymerReactionEngine could use SMIRKS strings to represent reactions that modify polymers
Disulfide bonds

cc @zhang-ivy

hannahbrucemacdonald commented 4 years ago

Ok I think that the allowed_mutations = [('L99A')] is sufficient for now.

It doesn't cover the things that you say it doesn't cover, but it's not like we can currently do these other things in perses at the moment.

I would personally be a fan of [('Lys', '99', 'Ala'), ('Xxx', 100, 'Xxx')] for two reasons. It would be flexible for doing pTyr in future (and also 3 letter codes are much easier to understand personally).

hannahbrucemacdonald commented 4 years ago

I haven't thought about it, but I imagine this would be easiest in terms of force fields too if we can pass in a string. Maybe we can have a hard-coded dict somewhere so you can do either L or Lys for the vanilla amino acids?

jchodera commented 4 years ago

I would personally be a fan of [('Lys', '99', 'Ala'), ('Xxx', 100, 'Xxx')] for two reasons. It would be flexible for doing pTyr in future (and also 3 letter codes are much easier to understand personally).

The bigger issue is that we need to know how to manipulate the Topology object when we make these mutations. If we want to specify things like Lys and pTyr, we would need to build our own dictionary of these residues.

It would be easier to use the Ligand Expo, which gives us access to any residue that has ever appeared in the PDB via its three-letter code. The whole residue library is available as SDF or SMILES for download, though the naming convention can be a little weird: the amino acid names are standard (e.g. LYS), but phosphotyrosine is PTR.

The other issue is that amino acids include an H- at the N-terminus and -OH at the C-terminus, so we would need to figure out how to process these to make the transformation. It's possible we can just use SMARTS strings to match valid amino acids that can be substituted, and use the matching to figure out what atoms to drop.

Syntax could be something like:

allowed_mutations = [
    [('LEU', 99, 'ALA')], # L99A
    [('LEU', 99, 'ALA'), ('MET', 102, 'GLN')], # L99A M102Q
    [('TYR', 103, 'PTR')], # phosphotyrosine
    [('CYS', 150, 'SNC')], # S-nitroso-cysteine
]

We'd pick up a ton of flexibility, but it would take some time to implement. Is it worth tackling this straight away, or doing something that would just build on what @dominicrufa has already in the short term and limiting ourselves to natural amino acids?

hannahbrucemacdonald commented 4 years ago

Personally I would say limiting ourselves to natural amino acids. There's already been a a lot of code written and thought put in and I think it's better to do something incrementally. We can do natural amino acids, demonstrate that perses can do this with a paper and then raise the bar by adding more amino acids or other biological weirdness in a future (and then publish again then).

We aren't lacking in cool examples if we limit ourselves the natural amino acids, and it's not like we've currently got someone actively asking for non-natural amino acids or a particular target that we are desperate to do (that I know of).

Let's keep it simple, but keep the code in a way that it's designed for future improvements. I think using three letter codes as you've shown above, passing round oemols and keeping the mapping functionality as unspecific as possible is doing so.

But also this is @dominicrufa and @zhang-ivy 's project so whatever they think.

jchodera commented 4 years ago

Personally I would say limiting ourselves to natural amino acids.

That would exclude some of @zhang-ivy's applications (which may include peptoids or retro-inverso peptides) and post-translational modifications like pTyr.

There's already been a a lot of code written and thought put in and I think it's better to do something incrementally. We can do natural amino acids, demonstrate that perses can do this with a paper and then raise the bar by adding more amino acids or other biological weirdness in a future (and then publish again then).

That's what I was thinking for the short term: If we're limiting ourselves to natural amino acids for a first implementation of ProteinMutationEngine, let's just adopt the standard biologist nomenclature (L99A, L99A M102Q) since it fits this application well.

We aren't lacking in cool examples if we limit ourselves the natural amino acids, and it's not like we've currently got someone actively asking for non-natural amino acids or a particular target that we are desperate to do (that I know of).

We are definitely going to need post-translational modifications (for kinases) and non-natural amino acids soon, but that's why I suggested we could write other specialized PolymerProposalEngine subclasses later to add this functionality.

hannahbrucemacdonald commented 4 years ago

What forcefield do you use for a peptoid?

What about a middle ground of doing 'amino acids that are in amber', then we can include the residues that are in the phosaa10.xml and then that covers the pTyr? And then the input is [('TYR', 103, 'PTR')]

hannahbrucemacdonald commented 4 years ago

Isn't PTR charged? We can't do charge mutations yet anyway?

jchodera commented 4 years ago

We can do charge mutations, but we would introduce an error when using PME. It could still be useful in the near term, and we can always introduce the counterion decoupling later on.

The bigger immediate problem, I just realized, is that only one of the approaches we have discussed above allows us to specify the exact protonation and tautomeric state of the residue to switch to. Until we introduce the ability to randomly select a protonation/tautomeric state and treat them as a single chemical entity, we are required to identify exactly which protonation/tautomeric state we want.

I think this means the only logical choice is @hannahbrucemacdonald 's suggestion of compiling a local dataset of residues indexed by force field residue name (HIS, HID, HIE, LYS, LYP, etc) so we can specify the exact protonation state we want.

hannahbrucemacdonald commented 4 years ago

@zhang-ivy have you addressed this with your work? If so could you please link the PR and close this issue?

zhang-ivy commented 4 years ago

Not yet, I checked with @dominicrufa and I think we need the residue templates added to openmmforcefield first? Will add this issue to my to do list though

jchodera commented 4 years ago

Not yet, I checked with @dominicrufa and I think we need the residue templates added to openmmforcefield first?

@zhang-ivy : Titratable residues are already present in openmmforcefields like ff14SB.xml. What else do you need here?

choderalab / perses

Change protein mutation specification convention #638