RosettaCommons / tools

Tools for parsing Rosetta source code and scripts for running specific Rosetta applications.
Other
3 stars 1 forks source link

Auto-updating antibody database #76

Closed lqtza closed 5 years ago

lqtza commented 5 years ago

RosettaAntibody Database

RosettaAntibody is a hybrid homology/de novo modeling approach for predicting antibody structure. As such, it is highly dependent on a database for antibody CDR and Framework templates.

The below text is true for the following git shas/branches:

Changes to the C++ code are primarily to read in the updated data files since they have slightly new format. Although I have not yet, I will likely move the location of some of these files from main to tools.

Database Contents

Briefly, the database has four categories of contents:

PDBs

I do not know 100% how the previous PDBs were selected. As of early 2019, we are moving towards automating PDB cultivation. As we are not experts in identifying antibodies from the PDB, we rely on SAbDab and pull the PDBs they have cultivated (with resolution better than 3 Angstroms).

The PDBs are Chothia numbered and this is necessary because most antibody code works under that assumption.

BLAST Databases

BLAST databases are generated for the CDR L1, L2, L3, H1, H2, H3 regions, the FRH and FRL regions, a combined "light-heavy" region for template selection. For the CDRs, these database are separated by length. The FRH/L databases should all be the same length (there are no variable insertions in the framework regions). The "light-heavy" is a single database, but uses the full sequence of the antibody for some reason, so it varies in length.

To generate the databases, sequences are extracted from the PDBs, assuming Chothia numbering. One issue here is how to handle missing residues. The current approach excludes regions if they are missing residues. Region definitions (in the Chothia numbering scheme) are below. Definitions are inclusive so H1 includes residue 26 and 35.

Heavy-chain CDRs

Light-chain CDRs

Frameworks

Light-heavy (orientation)

Info Files

The ranges above are used to extract sequences for the antibody.info, cdr.info, frh.info, frl.info, and frlh.info files. In the older database, some files use and underscore ("_") instead of a period ("."). These info files are used to generate the BLAST database, except for antibody.info, which is used by the C++ grafting code for filtering/appending results.

Metrics for Filtering

Finally, there are several "quality" metrics antibody.cc uses to filter out potential models. Previously, these came from three files: list_bfactor50, comparisons.txt, and outlier_list. I must speculated on the origin of all but one of these files. I know the comparisons.txt files contains OCD (see Marze, Lyskov, and Gray [PEDS, 2016]) values for all pairs of antibodies. These capture the orientation differences and were calculated using Nick's pilot app (packing_angle, I think). I assume the list_bfactor50 labels each CDR region as "true" if that region either has a single (or average) B-factor value above 50. I have no idea about the outlier list -- possibly it was there to exclude antibodies that issues grafting? In the automated version of the antibody database, I do not produce an outlier list.

Conserved Residues

During grafting (grafter.cc line 182-ish), a set of conserved framework residues is used to align the FRH and FRL templates to the orientation template. If these residues are missing from templates, then this grafting step will fail. So, when constructing the database, we check for the presence of the following residues:

Current Status

There is now a single script (create_antibody_db.py) which will:

  1. Download all sub-3.0 Angstrom antibody structures from SAbDab.
  2. Select single Fab from each PDB and truncate to Fv.
  3. Extract sequences for the above regions and check for structural issues.
  4. Generate BLAST database from the info files (generated in step 3).
  5. Calculated OCDs and extract B-factors.

Selecting a single Fv is not done rationally at the moment. In the future, we should select by chain with most resolved regions rather than first reported chain in SAbDab summary file.

The runtime is a bit slow as it loads PDBs into PyRosetta twice (steps 3 & 5).

The script probably can be optimized in a few ways as I initially just sought to replicate the previous database.

Replication of The Previous Database

It was not possible (why?) to perfectly replicate the previous database. These statistics are as of Feb. 15th, 2019. Overall statistics indicate an increase in template sources:

Region Old # New # Overlapping #
All CDRs 1902 2611 1560
FRH 1785 2390 1427
FRL 1577 1832 1111
Orientation 1003 1721 749

The overlapping PDBs could be used to compare sequences and metrics. I found most PDBs agreed. Below I report mismatches and discuss a few reasons for them.

Region Mismatches
CDR H1 30
CDR H2 60
CDR H3 33
CDR L1 11
CDR L2 15
CDR L3 8
FRH 27
FRL 2
Orientation 10

In general, mismatches occur at rates of ~1–2%, but why?

CDR H1 Mismatches

Present in new database, but not in old (3)

These PDBs are missing H1 atoms in the old database (not sure why), but not in the new one. Hence they are now included.

Present in old database, but not in new (18)

These are a mix of antibodies. Spot checking a few: 4jzn, 4jn2, and 1oay have missing atoms in the new H1. This indicates that we are not optimally selecting H/L chains (because these breaks were not previously present).

Present in both, but with different sequences (8)

This is likely due to the presence of multiple antibodies in the PDB. See 3zkx or 5bv7 as examples (I did not inspect all 9). We currently, only select one (though we could do multiple by implementing something like append an A/B/C... to the end of the PDB ID). Previously, all Fvs were extracted for the info file, but only one appears to be present in the PDB (which is a bad mistake because you might BLAST align to one thing and then graft another).

CDR H2 Mismatches

These are quite numerous and it might be because we did not use a consistent H2 definition (or maybe because we used a sequence-based definition that sometimes failed?)

Present in new database, but not in old (1)

Present in old database, but not in new (16)

Present in both, but with different sequences (43)

CDR H3 Mismatches

Present in new database, but not in old (3)

Present in old database, but not in new (20)

Present in both, but with different sequences (11)

CDR L1 Mismatches

Present in new database, but not in old (1)

Present in old database, but not in new (5)

Present in both, but with differnet sequences (5)

CDR L2 Mismatches

Present in new database, but not in old (1)

Present in old database, but not in new (2)

Present in both, but with different sequences (11)

CDR L3 Mismatches

Present in new database, but not in old (1)

Present in old database, but not in new (2)

Present in both, but with different sequences (5)

FRH Mismatches (27)

A few happen when there are surprise insertions (at positions not) expected. This ruins the assumption of a constant FRH, but only in a way, since the surprise insertions are in non-CDR loops (mostly DE, I think). Maybe we should be considering only the beta strands for the FRH/FRL templating? And begin grafting the H4/DE?

Other issues here are that our previous numbering the FRH (around 66) was not identical to actual Chothia numbering.

FRL Mismatches (2)

Orientation Mismatches (10)

Metric mismatches (numerous)

Metrics also do not match 100%. For the OCDs this is due to changes in which chain is pulled from the PDB (when multiple chains are possible) because the antibody structures will vary slightly across chains altering the PCA results and the corresponding orientation metrics. For B-factors this is likely because I do not use the same approach. Currently, I report the average B-factor value across the entire loop (including side-chain atoms). I think the previous approach set true/false if any backbone atom passed a threshold because my current approach does not yield as many outliers.

Summary of grafting performance on ~40 Abs

screen shot 2019-03-06 at 5 31 33 pm

Future Directions

  1. Is it necessary to use the whole light+heavy sequences for the orientation alignment? Is this the best approach?
  2. Should we create a region-based exclusion list (maybe this was the outlier list from before)? There are some crazy, e.g. 4k3d, 5c7x, 5d7s, 5dt1, 4s1q PDBs out there.
lqtza commented 5 years ago

This will now be moved to the main repo in additional_protocol_data.