PDB guesser - Githubissues

aya9aladdin commented 2 years ago

Is your feature request related to a problem?

as part of my gsoc project, I'm working on implementing a PDB-aware guesser to have a more accurate and reliable guesser for PDB files

Describe the solution you'd like

PDB has a rich archive called Chemical Component Dictionary (CCD) that contains references for all residue and small molecule components found in PDB entries (https://www.wwpdb.org/data/ccd). We can use this dictionary to get valuable data about any ligand of interest (elements, charges, bonds, bond order, aromaticity, etc.). To use the CCD in implementing PDBGuesser there is one of two approaches we could take:

1- use pdbeccdutils API, which is a tool used to read CCD files and parse their data in a Component object. The Component object has a ccd_cif_dict attribute which is simply a dictionary for all the data of the ligand that we are interested in. We can develop our guesser on the idea of requesting the CCD file of each residue and then use pdbeccdutils.core.ccd_reader. read_pdb_cif_file to parse the file data into Component, and from component we can get whatever data we are searching for.

import requests
from pdbeccdutils.core import ccd_reader

def download_component(ccd_id):
    response = requests.get(f'http://ftp.ebi.ac.uk/pub/databases/msd/pdbechem_v2/{ccd_id[0]}/{ccd_id}/{ccd_id}.cif')
    cif_path = f'{ccd_id}.cif'

    with open(cif_path, 'wb') as fp:
        fp.write(response.content)

    return cif_path

lysin = download_component('LYS')

ccd_reader_result = ccd_reader.read_pdb_cif_file(lysin)
component = ccd_reader_result.component
#get the atom charges data
print(component.ccd_cif_dict['_chem_comp_atom']['charge'])
#get the bonding data, bond orders, and aromaticity
print(component.ccd_cif_dict['_chem_comp_bond']['atom_id_1'])
print(component.ccd_cif_dict['_chem_comp_bond']['atom_id_2'])
print(component.ccd_cif_dict['_chem_comp_bond']['value_order'])
print(component.ccd_cif_dict['_chem_comp_bond']['pdbx_aromatic_flag'])

output:
['0', '0', '0', '0', '0', '0', '0', '0', '0', '0']
['N', 'N', 'N', 'CA', 'CA', 'CA', 'C', 'C', 'OXT']
['CA', 'H', 'H2', 'C', 'HA2', 'HA3', 'O', 'OXT', 'HXT']
['SING', 'SING', 'SING', 'SING', 'SING', 'SING', 'DOUB', 'SING', 'SING']
['N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N']

My concern with this approach is that I think it could be slow as it takes several steps to reach the information that we are searching for (download file for each residue -> parse it -> get the data). This could be problematic if we are guessing attribute(s) for a huge PDB file.

Describe alternatives you've considered

2- the other approach is to build our local database for the CCD in the mdanalysis package. This approach is the least favorite for me for two reasons:

the CCD size is 380 MB, and after filtering it to get the database of interest for us it can be something around 100 MB, which I think is a huge database for a single guesser
the CCD is updated weekly, so the database will need frequent maintenance

So what do you think is the better way to adopt? or maybe there is another better third option that I didn't think about

IAlibay commented 2 years ago

From a packaging standpoint we absolutely can't ship the library with a 300 MB+ database.

Unfortunately relying on external downloads isn't likely to go very well on clusters or some industry locations either (a lot of which can ban external internet access).

I believe @richardjgowers had some thoughts / working on some of this here. I remember some agreement that maybe doing an external optional package that did rdkit based tooling for getting bonds orders etc.. + storing the database.

IAlibay commented 2 years ago

In the first instance though, I would push for an initial solution that does not rely on the CCD. The standard residues are well defined (except from bond orders and formal charges which can be ambiguous in some cases), so being able to offer accurate information for these would be a massive improvement. Not being able to recognise HETATM entries is not a problem for a first pass at this problem imho.

aya9aladdin commented 2 years ago

In the first instance though, I would push for an initial solution that does not rely on the CCD. The standard residues are well defined (except from bond orders and formal charges which can be ambiguous in some cases), so being able to offer accurate information for these would be a massive improvement. Not being able to recognise HETATM entries is not a problem for a first pass at this problem imho.

I remember when I was writing the proposal for the PDBGuesser my first draft was relying on taking advantage of the organisation of the pdb format. It has well-defined format and nomenclature rules that could be of great advantage + standard residues are limited and easy to define their properties as you mentioned

aya9aladdin commented 2 years ago

I'll begin with guessing types, bonds and masses and pull an initial implementation to see how we could improve it

aya9aladdin commented 2 years ago

So I scanned the CCD file and tried to estimate the size of the actual data that we need from it for the bond, bond orders, aromaticity, and elements guessing (Although I already implemented element guessing based on PDB naming rules, not database). What I'm thinking about is as follows: 1- number of residues in the entire database is approx 38K residue 2- to store the data needed for PDBGuesser, it takes 1 KB on average for each residue, so a total of 38MB for the entire database, which I think is still huge tho 3- each residue entry will be something like that:

LEU="""
  #atname1 #atname2 #bond_order #aromatic flag #attype1 #attype2
  N    CA    SING  N     N      C
  N    H     SING  N     N      C
  N    H2    SING  N     N      H
  CA   C     SING  N     C      C
  CA   CB    SING  N     C      C
  CA   HA    SING  N     C      C
  C    O     DOUB  N     C      O
  C    OXT   SING  N     C      O
  CB   CG    SING  N     C      C
  CB   HB2   SING  N     C      H
  CB   HB3   SING  N     C      H
  CG   CD1   SING  N     C      C
  CG   CD2   SING  N     C      C
  CG   HG    SING  N     C      H
  CD1  HD11  SING  N     C      H
  CD1  HD12  SING  N     C      H
  CD1  HD13  SING  N     C      H
  CD2  HD21  SING  N     C      H
  CD2  HD22  SING  N     C      H
  CD2  HD23  SING  N     C      H
  OXT  HXT   SING  N     O      H

"""

4- we can begin by building the database for just the standard residue (amino acids & nucleic acids) + some recognized special hetero groups, which equates to a total of around 100 residues + the help of some known rules (metals and solvents don't form covalent bonds) + guessing bonds for the rest of residues by the covalent radii distance, then we will have a good first version of bond guessing for PDB What do you think?

MDAnalysis / mdanalysis

PDB guesser #3856

Is your feature request related to a problem?

Describe the solution you'd like

Describe alternatives you've considered