Open aya9aladdin opened 2 years ago
From a packaging standpoint we absolutely can't ship the library with a 300 MB+ database.
Unfortunately relying on external downloads isn't likely to go very well on clusters or some industry locations either (a lot of which can ban external internet access).
I believe @richardjgowers had some thoughts / working on some of this here. I remember some agreement that maybe doing an external optional package that did rdkit based tooling for getting bonds orders etc.. + storing the database.
In the first instance though, I would push for an initial solution that does not rely on the CCD. The standard residues are well defined (except from bond orders and formal charges which can be ambiguous in some cases), so being able to offer accurate information for these would be a massive improvement. Not being able to recognise HETATM entries is not a problem for a first pass at this problem imho.
In the first instance though, I would push for an initial solution that does not rely on the CCD. The standard residues are well defined (except from bond orders and formal charges which can be ambiguous in some cases), so being able to offer accurate information for these would be a massive improvement. Not being able to recognise HETATM entries is not a problem for a first pass at this problem imho.
I remember when I was writing the proposal for the PDBGuesser my first draft was relying on taking advantage of the organisation of the pdb format. It has well-defined format and nomenclature rules that could be of great advantage + standard residues are limited and easy to define their properties as you mentioned
I'll begin with guessing types, bonds and masses and pull an initial implementation to see how we could improve it
So I scanned the CCD file and tried to estimate the size of the actual data that we need from it for the bond, bond orders, aromaticity, and elements guessing (Although I already implemented element guessing based on PDB naming rules, not database). What I'm thinking about is as follows: 1- number of residues in the entire database is approx 38K residue 2- to store the data needed for PDBGuesser, it takes 1 KB on average for each residue, so a total of 38MB for the entire database, which I think is still huge tho 3- each residue entry will be something like that:
LEU="""
#atname1 #atname2 #bond_order #aromatic flag #attype1 #attype2
N CA SING N N C
N H SING N N C
N H2 SING N N H
CA C SING N C C
CA CB SING N C C
CA HA SING N C C
C O DOUB N C O
C OXT SING N C O
CB CG SING N C C
CB HB2 SING N C H
CB HB3 SING N C H
CG CD1 SING N C C
CG CD2 SING N C C
CG HG SING N C H
CD1 HD11 SING N C H
CD1 HD12 SING N C H
CD1 HD13 SING N C H
CD2 HD21 SING N C H
CD2 HD22 SING N C H
CD2 HD23 SING N C H
OXT HXT SING N O H
"""
4- we can begin by building the database for just the standard residue (amino acids & nucleic acids) + some recognized special hetero groups, which equates to a total of around 100 residues + the help of some known rules (metals and solvents don't form covalent bonds) + guessing bonds for the rest of residues by the covalent radii distance, then we will have a good first version of bond guessing for PDB What do you think?
Is your feature request related to a problem?
as part of my gsoc project, I'm working on implementing a PDB-aware guesser to have a more accurate and reliable guesser for PDB files
Describe the solution you'd like
PDB has a rich archive called Chemical Component Dictionary (CCD) that contains references for all residue and small molecule components found in PDB entries (https://www.wwpdb.org/data/ccd). We can use this dictionary to get valuable data about any ligand of interest (elements, charges, bonds, bond order, aromaticity, etc.). To use the CCD in implementing PDBGuesser there is one of two approaches we could take:
1- use pdbeccdutils API, which is a tool used to read CCD files and parse their data in a
Component
object. TheComponent
object has accd_cif_dict
attribute which is simply a dictionary for all the data of the ligand that we are interested in. We can develop our guesser on the idea of requesting the CCD file of each residue and then usepdbeccdutils.core.ccd_reader. read_pdb_cif_file
to parse the file data intoComponent
, and fromcomponent
we can get whatever data we are searching for.My concern with this approach is that I think it could be slow as it takes several steps to reach the information that we are searching for (download file for each residue -> parse it -> get the data). This could be problematic if we are guessing attribute(s) for a huge PDB file.
Describe alternatives you've considered
2- the other approach is to build our local database for the CCD in the mdanalysis package. This approach is the least favorite for me for two reasons:
So what do you think is the better way to adopt? or maybe there is another better third option that I didn't think about