Integrate OpenFF / OpenFE protein loaders so that entire system has Molecule representations

jchodera commented 1 year ago

We should try to integrate these loaders so that we can eventually use tools like OpenFF or Espaloma to parameterize the receptor/biomolecule as well as the ligand.

richardjgowers commented 9 months ago

Just to leave some breadcrumbs, the openfe approach is to try and follow the PDB-recommended route of matching residue-by-residue mmcif templates to the raw file to assign bond orders, aromaticity & formal charges. The repo for this is here: https://github.com/OpenFreeEnergy/pdbinf

An example for loading CDK2 (which features a nonstandard residue) is here: https://github.com/OpenFreeEnergy/pdbinf/blob/main/notebooks/tpo_load.ipynb

I've also played around with questions of, if the monomer has an incorrect label or the atoms have incorrect labels, can you still find/apply the correct template: https://github.com/OpenFreeEnergy/pdbinf/blob/main/notebooks/tpo_guessing_demo.ipynb

It should currently handle standard AAs, RNA, DNA and if you download the chemical component dictionary (or any template) anything which is a standard nonstandard component. This is all still hinging on the residues being correctly delimited, if for example you had a cap that had been merged with the neighbouring residue this wouldn't be handled well.

The OpenFF approach is to provide SMARTS templates + atom names to Topology.from_from_polymer_PDB and doesn't require correct monomer delimiting. Cons are that it (currently) doesn't have a way to create these templates and I think the performance is slower as it's not (ab)using the presence of residues to load molecules.

jchodera commented 9 months ago

Just to leave some breadcrumbs, the openfe approach is to try and follow the PDB-recommended route of matching residue-by-residue mmcif templates to the raw file to assign bond orders, aromaticity & formal charges. The repo for this is here: https://github.com/OpenFreeEnergy/pdbinf

@richardjgowers : As I understand it, the mmcif templates are fully protonated forms of the non-polymeric (non-residue) form of each residue, meaning matching must be done based on canonical residue and atom names. Is this the strategy that OpenFE uses?

This is a PDB-recommended approach, but quickly breaks down when you are dealing with molecules not currently in the chemical component dictionary, like small molecules of interest. In this case, there may be no canonical naming for the entities.

Could you elaborate on the philosophy behind this approach that would enable someone to deal with small molecules or polymeric residues not currently in the CCD? Is the expectation that the user will provide a local set of additions to the chemical component dictionary, establishing their own canonical residue and atom naming schemes that do not conflict with the official PDB CCD? What happens if the PDB updates to include residue names that clash?

I don't think this is a bad approach, but I'd love to better understand how the workflow is envisioned to be usable even under ideal circumstances before diving down into the technical details.

choderalab / perses

Integrate OpenFF / OpenFE protein loaders so that entire system has Molecule representations #1182