RosettaAntibody is a hybrid homology/de novo modeling approach for predicting antibody structure. As such, it is highly dependent on a database for antibody CDR and Framework templates.
The below text is true for the following git shas/branches:
Changes to the C++ code are primarily to read in the updated data files since they have slightly new format. Although I have not yet, I will likely move the location of some of these files from main to tools.
Database Contents
Briefly, the database has four categories of contents:
PDBs
Chothia numbered
BLAST databases
Generated by extracting sequences from the above PDBs.
"Info" files
Generated by extracting sequences from the above PDBs.
Used for filtering by sequence identity or for selecting additional
templates (by randomly matching length).
Metric files
E.g. B factors or OCDs used as quality filters.
PDBs
I do not know 100% how the previous PDBs were selected. As of early 2019, we are moving towards automating PDB cultivation. As we are not experts in identifying antibodies from the PDB, we rely on SAbDab and pull the PDBs they have cultivated (with resolution better than 3 Angstroms).
The PDBs are Chothia numbered and this is necessary because most antibody code works under that assumption.
BLAST Databases
BLAST databases are generated for the CDR L1, L2, L3, H1, H2, H3 regions, the FRH and FRL regions, a combined "light-heavy" region for template selection. For the CDRs, these database are separated by length. The FRH/L databases should all be the same length (there are no variable insertions in the
framework regions). The "light-heavy" is a single database, but uses the full sequence of the antibody for some reason, so it varies in length.
To generate the databases, sequences are extracted from the PDBs, assuming Chothia numbering. One issue here is how to handle missing residues. The current approach excludes regions if they are missing residues. Region definitions (in the Chothia numbering scheme) are below. Definitions are
inclusive so H1 includes residue 26 and 35.
Heavy-chain CDRs
H1: 26–35
H2: 50–65
H3: 95–102
Light-chain CDRs
L1: 24–34
L2: 50–56
L3: 89–97
Frameworks
FRH: 10–25, 36–39, 46–49, 66–94, 103–109
FRL: 10–23, 35–39, 46–49, 57–66, 71–88, 98–104
Light-heavy (orientation)
Heavy: 5–109
Light: 5–104
Info Files
The ranges above are used to extract sequences for the antibody.info, cdr.info, frh.info, frl.info, and frlh.info files. In the older database, some files use and underscore ("_") instead of a period ("."). These info files are used to generate the BLAST database, except for antibody.info, which is used by the C++ grafting code for filtering/appending results.
Metrics for Filtering
Finally, there are several "quality" metrics antibody.cc uses to filter out potential models. Previously, these came from three files: list_bfactor50, comparisons.txt, and outlier_list. I must speculated on the origin of all but one of these files. I know the comparisons.txt files contains OCD
(see Marze, Lyskov, and Gray [PEDS, 2016]) values for all pairs of antibodies. These capture the orientation differences and were calculated using Nick's pilot app (packing_angle, I think). I assume the list_bfactor50 labels each CDR region as "true" if that region either has a single (or average) B-factor value above 50. I have no idea about the outlier list -- possibly it was there to exclude antibodies that issues grafting? In the automated version of the antibody database, I do not produce an outlier list.
Conserved Residues
During grafting (grafter.cc line 182-ish), a set of conserved framework residues is used to align the FRH and FRL templates to the orientation template. If these residues are missing from templates, then this grafting step will fail. So, when constructing the database, we check for the presence of the following residues:
There is now a single script (create_antibody_db.py) which will:
Download all sub-3.0 Angstrom antibody structures from SAbDab.
Select single Fab from each PDB and truncate to Fv.
Extract sequences for the above regions and check for structural issues.
Generate BLAST database from the info files (generated in step 3).
Calculated OCDs and extract B-factors.
Selecting a single Fv is not done rationally at the moment. In the future, we should select by chain with most resolved regions rather than first reported chain in SAbDab summary file.
The runtime is a bit slow as it loads PDBs into PyRosetta twice (steps 3 & 5).
The script probably can be optimized in a few ways as I initially just sought to replicate the previous database.
Replication of The Previous Database
It was not possible (why?) to perfectly replicate the previous database. These statistics are as of Feb. 15th, 2019. Overall statistics indicate an increase in template sources:
Region
Old #
New #
Overlapping #
All CDRs
1902
2611
1560
FRH
1785
2390
1427
FRL
1577
1832
1111
Orientation
1003
1721
749
The overlapping PDBs could be used to compare sequences and metrics. I found most PDBs agreed. Below I report mismatches and discuss a few reasons for them.
Region
Mismatches
CDR H1
30
CDR H2
60
CDR H3
33
CDR L1
11
CDR L2
15
CDR L3
8
FRH
27
FRL
2
Orientation
10
In general, mismatches occur at rates of ~1–2%, but why?
CDR H1 Mismatches
Present in new database, but not in old (3)
3qxu (multiple Fvs, the old one has a chainbreak)
1yc7 (multiple Fvs, the old one has a chainbreak)
1mvf (multiple Fvs, the old one has a chainbreak)
These PDBs are missing H1 atoms in the old database (not sure why), but not in the new one. Hence they are now included.
Present in old database, but not in new (18)
1ol0 has a chainbreak
4at6 (multiple Fvs, the selected one has a chainbreak)
3c5s (multiple Fvs, the selected one has a chainbreak)
2p46 new PDB has chainbreaks, but old one doesn't?!
2p44 new PDB has chainbreaks, but old one doesn't?!
2p45 new PDB has chainbreaks, but old one doesn't?!
2p42 new PDB has chainbreaks, but old one doesn't?!
2p43 new PDB has chainbreaks, but old one doesn't?!
2p48 new PDB has chainbreaks, but old one doesn't?!
4tuj (multiple Fvs, the selected one has a chainbreak)
4jzn (multiple Fvs, the selected one has a chainbreak)
2p47 new PDB has chainbreaks, but old one doesn't?!
4y5y has a chainbreak in the H1
4k3e (multiple Fvs, the selected one has a chainbreak)
4jn2 (multiple Fvs, the selected one has a chainbreak)
1oay has multiple Fvs in the PDB (new PDB select VHH with chainbreak)
1oax has multiple Fvs in the PDB (new PDB select VL only)
1oau has multiple Fvs in the PDB (new PDB selects VHH with chainbreak)
These are a mix of antibodies. Spot checking a few: 4jzn, 4jn2, and 1oay have missing atoms in the new H1. This indicates that we are not optimally selecting H/L chains (because these breaks were not previously present).
Present in both, but with different sequences (8)
1pg7 has multiple Fvs in the PDB
4nzr old PBD has different numbering and is truncated
4krp has multiple Fvs in the PDB
4rgn has multiple Fvs in the PDB
4k7p has the wrong sequence in the old antibody_info file
5bv7 has multiple Fvs in the PDB
4nc1 has the wrong sequence in the old antibody_info file
3zkx has multiple Fvs in the PDB
This is likely due to the presence of multiple antibodies in the PDB. See 3zkx or 5bv7 as examples (I did not inspect all 9). We currently, only select one (though we could do multiple by implementing something like append an A/B/C... to the end of the PDB ID). Previously, all Fvs were extracted for the info file, but only one appears to be present in the PDB (which is a bad mistake because you might BLAST align to one thing and then graft another).
CDR H2 Mismatches
These are quite numerous and it might be because we did not use a consistent H2 definition (or maybe because we used a sequence-based definition that sometimes failed?)
Present in new database, but not in old (1)
1mie old PDB is missing the heavy chain
Present in old database, but not in new (16)
3c5s (multiple Fvs, the selected one has a chainbreak)
2p46 new PDB has chainbreaks, but old one doesn't?!
2p44 new PDB has chainbreaks, but old one doesn't?!
2p45 new PDB has chainbreaks, but old one doesn't?!
2p42 new PDB has chainbreaks, but old one doesn't?!
2p43 new PDB has chainbreaks, but old one doesn't?!
2p48 new PDB has chainbreaks, but old one doesn't?!
4g6k has a chainbreak in the H2
4g6m has a chainbreak in the H2
3phq has a chainbreak in the H2
2p47 new PDB has chainbreaks, but old one doesn't?!
4y5y has a chainbreak in the H2
1oay has multiple Fvs in the PDB (new PDB select VHH, with chainbreak)
1oax has multiple Fvs in the PDB (new PDB select VL only)
1oar has multiple Fvs in the PDB (new PDB selects a broken one)
1oau has multiple Fvs in the PDB (new PDB selects VHH, with chainbreak)
Present in both, but with different sequences (43)
4ydl old H2 sequence runs past 65 (to 67)
1pg7 has multiple Fvs in the PDB
1ol0 old H2 sequence stops at 64, not 65
5dtf old H2 sequence runs past 65 (to 69)
5mje old PBD has different numbering and is truncated
4toy old PBD has different numbering and is truncated
3uyr appears to have been mis-parsed in the old PDB
4od1 has a tricky H3 (appears to have been mis-parsed in the old PDB)
4m3j old H2 sequence misses residues 50-52 (why)
4m3k old H2 sequence misses residues 50-52 (why)
4kph old H2 sequence stops at 64, not 65
4nzr old PBD has different numbering and is truncated
5dmg old H2 sequence stops at 63, not 65
4krn old H2 sequence stops at 62, not 65
4krp has multiple Fvs in the PDB
4ma3 old H2 sequence stops at 64, not 65
4z9k old H2 sequence stops at 63, not 65
3v52 appears to have been mis-parsed in the old PDB
4rgn has multiple Fvs in the PDB
4o51 old H2 sequence stops at 64, not 65
4k7p has the wrong sequence in the old antibody_info file
5j57 old H2 sequence misses residues 50-52 (why)
3v4u appears to have been mis-parsed in the old PDB
4ztp old H2 sequence stops at 64, not 65
4zto old H2 sequence stops at 64, not 65
5dub old H2 sequence stops at 63, not 65
4o4y old H2 sequence stops at 64, not 65
5dt1 appears to have been mis-parsed in the old PDB
4jo3 old H2 sequence stops at 64, not 65
4s1q old PBD has different numbering and is truncated
5bv7 has multiple Fvs in the PDB
4jpv has different numbering in the old PDB (unclear why)
3gk8 old H2 sequence stops at 64, not 65
4fhb old H2 sequence misses residues 50 & 51 (why)
4nc1 has the wrong sequence in the old antibody_info file
5ds8 old H2 sequence stops at 63, not 65
5dsc old H2 sequence stops at 63, not 65
3dv4 old H2 sequence stops at 63, not 65
1flr old H2 sequence stops at 63, not 65
3zkx has multiple Fvs in the PDB
4x7d appears to have been mis-parsed in the old PDB
4x7c old H2 sequence stops at 62, not 65
3nps has different numbering in the old PDB (unclear why)
CDR H3 Mismatches
Present in new database, but not in old (3)
3upc should be excluded H3 appears unfolded/extended?
1sjv should be excluded H3 appears unfolded/extended?
Present in old database, but not in new (20)
4dqo has a chainbreak in the H3
4od1 has a tricky H3 (appears to have been mis-parsed in the old PDB)
1fh5 has a chainbreak in the H3
1fve has multiple Fvs, the selected one has a chainbreak
4yaq has a chainbreak in the H3
1yzz has a chainbreak in the H3
3mug has a chainbreak in the H3
5hdo has multiple Fvs, the selected one has a chainbreak
2gk0 has multiple Fvs, the selected one has a chainbreak
5dt1 has a chainbreak in the H3
4nrx has multiple Fvs, the selected one has a chainbreak
3u4e has a chainbreak in the H3
1oay has multiple Fvs in the PDB (new PDB select VHH, with chainbreak)
1oax has multiple Fvs in the PDB (new PDB select VL only)
1oau has multiple Fvs in the PDB (new PDB selects VHH, with chainbreak)
1zlv has multiple Fvs, the selected one has a chainbreak
1xf3 has multiple residues with 0 occupancy in the H3
1xf4 has multiple residues with 0 occupancy in the H3
3u1s has a chainbreak in the H3
3lh2 has multiple Fvs, the selected one has a chainbreak
Present in both, but with different sequences (11)
1pg7 has multiple Fvs in the PDB
3uyr has a tricky H3 (appears to have been mis-parsed in the old PDB)
4krp has multiple Fvs in the PDB
3v52 has a tricky H3 (appears to have been mis-parsed in the old PDB)
4rgn has multiple Fvs in the PDB
4k7p has the wrong sequence in the old antibody_info file
3v4u has a tricky H3 (appears to have been mis-parsed in the old PDB)
4k3d has an H3 so long there are insertions on 101 and 102
5bv7 has multiple Fvs in the PDB
4nc1 has the wrong sequence in the old antibody_info file
3zkx has multiple Fvs in the PDB
CDR L1 Mismatches
Present in new database, but not in old (1)
4krp is missing its light chain in the old PDB.
Present in old database, but not in new (5)
4jpv no L34, so geometry-checking function fails
1fn4 has multiple Fvs in the PDB (new PDB selects a broken one)
1oay has multiple Fvs in the PDB (new PDB select VHH, not Fv)
1oar has multiple Fvs in the PDB (new PDB selects a broken one)
1oau has multiple Fvs in the PDB (new PDB selects a broken one)
Present in both, but with differnet sequences (5)
1pg7 has multiple Fvs in the PDB
4rgn has multiple Fvs in the PDB
4k7p has the wrong sequence in the old antibody_info file
4a6y has multiple Fvs in the PDB
5bv7 has multiple Fvs in the PDB
CDR L2 Mismatches
Present in new database, but not in old (1)
4krp is missing its light chain in the old PDB.
Present in old database, but not in new (2)
1oay has multiple Fvs in the PDB (new PDB select VHH, not Fv)
1oau has multiple Fvs in the PDB (new PDB select VHH, not Fv)
1i3g has 3.4 Angstrom C-N bond (resi 55)
Present in both, but with different sequences (11)
1pg7 has multiple Fvs in the PDB
4dcq was numbered differently in the old PDB
5d72 has an atypical L2 that is mis-numbered in the old
3ffd was numbered differently in the old PDB
4rgn has multiple Fvs in the PDB
4k7p has the wrong sequence in the old antibody_info file
2otw was numbered differently in the old PDB
5eor was numbered differently in the old PDB
5d7s has an atypical L2 that is mis-numbered in the old
5c7x has an atypical L2 that is mis-numbered in the old
5bv7 has multiple Fvs in the PDB
CDR L3 Mismatches
Present in new database, but not in old (1)
4krp is missing its light chain in the old PDB.
Present in old database, but not in new (2)
1oay has multiple Fvs in the PDB (new PDB select VHH, not Fv)
1oau has multiple Fvs in the PDB (new PDB select VHH, not Fv)
Present in both, but with different sequences (5)
1pg7 has multiple Fvs in the PDB
4rgn has multiple Fvs in the PDB
4k7p has the wrong sequence in the old antibody_info file
5bv7 has multiple Fvs in the PDB
3mlr has a super long L3 that is truncated in the old PDB
FRH Mismatches (27)
A few happen when there are surprise insertions (at positions not) expected. This ruins the assumption of a constant FRH, but only in a way, since the surprise insertions are in non-CDR loops (mostly DE, I think). Maybe we should be considering only the beta strands for the FRH/FRL templating? And begin grafting the H4/DE?
Other issues here are that our previous numbering the FRH (around 66) was not identical to actual Chothia numbering.
4ydl numbering differs (66S vs. 66K, but LS after K missing in the new Fv
because they are numbered by a rare insertion at 66).
4ydj has a super long H3 that is truncated in the old PDB and affects the frh
1pg7 has multiple Fvs in the PDB
5mje has "surprise" insertions at 73 that are "missed"
4toy has a missing DE loop in the old PDB so is misnumbered
3uyr has different numbering in the old PDB (unclear why)
1e4x has multiple Fvs in the PDB
4nzr has a super long H3 that is truncated in the old PDB and affects the frh
4jfx truncated poorly in old DB (starts residue 26)
3v52 has a super long H3 that is truncated in the old PDB and affects the frh
4rgn has multiple Fvs in the PDB
4o51 has different numbering in the old PDB (why)
3v4u has a tricky H3 (appears to have been mis-parsed in the old PDB)
4a6y has multiple Fvs in the PDB
4o4y has different numbering in the old PDB (why)
4jo3 has different numbering in the old PDB (why)
4k3d has a super long H3 that is truncated in the old PDB and affects the frh
4k3e has a super long H3 that is truncated in the old PDB and affects the frh
4s1q has a super long H3 that is truncated in the old PDB and affects the frh
5bv7 has multiple Fvs in the PDB
4grw has a blatantly wrong old sequence...
4fnl has a super long H3 that is truncated in the old PDB and affects the frh
4jpv has different numbering in the old PDB (why)
3gk8 has different numbering in the old PDB (why)
4ma3 has different numbering in the old PDB (why)
3u1s has a super long H3 that is truncated in the old PDB and affects the frh
3zkx has multiple Fvs in the PDB
FRL Mismatches (2)
4rgn has multiple Fvs in the PDB
3lmr has more residues in the new db, because the old light chain is truncated
Orientation Mismatches (10)
3u6r has a residue insertion (L), due to a numbering difference.
1t2q has a residue insertion (L), due to a numbering difference.
1e4x has multiple Fvs in the PDB
1rhh has a residue insertion (L), due to a numbering difference.
3lmr has more residues in the new db, because the old light chain is truncated
2d03 has a residue insertion (L), due to a numbering difference.
1zlv has multiple Fvs in the PDB
3lh2 has multiple Fvs in the PDB
3gk8 has a residue insertion (HL), due to a numbering difference.
2gk0 has multiple Fvs in the PDB
Metric mismatches (numerous)
Metrics also do not match 100%. For the OCDs this is due to changes in which chain is pulled from the PDB (when multiple chains are possible) because the antibody structures will vary slightly across chains altering the PCA results and the corresponding orientation metrics. For B-factors this is likely because I do not use the same approach. Currently, I report the average B-factor value across the entire loop (including side-chain atoms). I think the previous approach set true/false if any backbone atom passed a threshold because my current approach does not yield as many outliers.
Summary of grafting performance on ~40 Abs
53.5% of FR RMSDs are lower.
55.0% of CDR RMSDs are lower.
50.2% of OCDs are lower (all 10 models compared).
Future Directions
Is it necessary to use the whole light+heavy sequences for the orientation alignment? Is this the best approach?
Should we create a region-based exclusion list (maybe this was the outlier list from before)? There are some crazy, e.g. 4k3d, 5c7x, 5d7s, 5dt1, 4s1q PDBs out there.
RosettaAntibody Database
RosettaAntibody is a hybrid homology/de novo modeling approach for predicting antibody structure. As such, it is highly dependent on a database for antibody CDR and Framework templates.
The below text is true for the following git shas/branches:
Changes to the C++ code are primarily to read in the updated data files since they have slightly new format. Although I have not yet, I will likely move the location of some of these files from main to tools.
Database Contents
Briefly, the database has four categories of contents:
PDBs
I do not know 100% how the previous PDBs were selected. As of early 2019, we are moving towards automating PDB cultivation. As we are not experts in identifying antibodies from the PDB, we rely on SAbDab and pull the PDBs they have cultivated (with resolution better than 3 Angstroms).
The PDBs are Chothia numbered and this is necessary because most antibody code works under that assumption.
BLAST Databases
BLAST databases are generated for the CDR L1, L2, L3, H1, H2, H3 regions, the FRH and FRL regions, a combined "light-heavy" region for template selection. For the CDRs, these database are separated by length. The FRH/L databases should all be the same length (there are no variable insertions in the framework regions). The "light-heavy" is a single database, but uses the full sequence of the antibody for some reason, so it varies in length.
To generate the databases, sequences are extracted from the PDBs, assuming Chothia numbering. One issue here is how to handle missing residues. The current approach excludes regions if they are missing residues. Region definitions (in the Chothia numbering scheme) are below. Definitions are inclusive so H1 includes residue 26 and 35.
Heavy-chain CDRs
Light-chain CDRs
Frameworks
Light-heavy (orientation)
Info Files
The ranges above are used to extract sequences for the
antibody.info
,cdr.info
,frh.info
,frl.info
, andfrlh.info
files. In the older database, some files use and underscore ("_") instead of a period ("."). These info files are used to generate the BLAST database, except forantibody.info
, which is used by the C++ grafting code for filtering/appending results.Metrics for Filtering
Finally, there are several "quality" metrics antibody.cc uses to filter out potential models. Previously, these came from three files:
list_bfactor50
,comparisons.txt
, andoutlier_list
. I must speculated on the origin of all but one of these files. I know thecomparisons.txt
files contains OCD (see Marze, Lyskov, and Gray [PEDS, 2016]) values for all pairs of antibodies. These capture the orientation differences and were calculated using Nick's pilot app (packing_angle, I think). I assume thelist_bfactor50
labels each CDR region as "true" if that region either has a single (or average) B-factor value above 50. I have no idea about the outlier list -- possibly it was there to exclude antibodies that issues grafting? In the automated version of the antibody database, I do not produce an outlier list.Conserved Residues
During grafting (grafter.cc line 182-ish), a set of conserved framework residues is used to align the FRH and FRL templates to the orientation template. If these residues are missing from templates, then this grafting step will fail. So, when constructing the database, we check for the presence of the following residues:
Heavy
Light
Current Status
There is now a single script (create_antibody_db.py) which will:
Selecting a single Fv is not done rationally at the moment. In the future, we should select by chain with most resolved regions rather than first reported chain in SAbDab summary file.
The runtime is a bit slow as it loads PDBs into PyRosetta twice (steps 3 & 5).
The script probably can be optimized in a few ways as I initially just sought to replicate the previous database.
Replication of The Previous Database
It was not possible (why?) to perfectly replicate the previous database. These statistics are as of Feb. 15th, 2019. Overall statistics indicate an increase in template sources:
The overlapping PDBs could be used to compare sequences and metrics. I found most PDBs agreed. Below I report mismatches and discuss a few reasons for them.
In general, mismatches occur at rates of ~1–2%, but why?
CDR H1 Mismatches
Present in new database, but not in old (3)
These PDBs are missing H1 atoms in the old database (not sure why), but not in the new one. Hence they are now included.
Present in old database, but not in new (18)
These are a mix of antibodies. Spot checking a few: 4jzn, 4jn2, and 1oay have missing atoms in the new H1. This indicates that we are not optimally selecting H/L chains (because these breaks were not previously present).
Present in both, but with different sequences (8)
This is likely due to the presence of multiple antibodies in the PDB. See 3zkx or 5bv7 as examples (I did not inspect all 9). We currently, only select one (though we could do multiple by implementing something like append an A/B/C... to the end of the PDB ID). Previously, all Fvs were extracted for the info file, but only one appears to be present in the PDB (which is a bad mistake because you might BLAST align to one thing and then graft another).
CDR H2 Mismatches
These are quite numerous and it might be because we did not use a consistent H2 definition (or maybe because we used a sequence-based definition that sometimes failed?)
Present in new database, but not in old (1)
Present in old database, but not in new (16)
Present in both, but with different sequences (43)
CDR H3 Mismatches
Present in new database, but not in old (3)
Present in old database, but not in new (20)
Present in both, but with different sequences (11)
CDR L1 Mismatches
Present in new database, but not in old (1)
Present in old database, but not in new (5)
Present in both, but with differnet sequences (5)
CDR L2 Mismatches
Present in new database, but not in old (1)
Present in old database, but not in new (2)
Present in both, but with different sequences (11)
CDR L3 Mismatches
Present in new database, but not in old (1)
Present in old database, but not in new (2)
Present in both, but with different sequences (5)
FRH Mismatches (27)
A few happen when there are surprise insertions (at positions not) expected. This ruins the assumption of a constant FRH, but only in a way, since the surprise insertions are in non-CDR loops (mostly DE, I think). Maybe we should be considering only the beta strands for the FRH/FRL templating? And begin grafting the H4/DE?
Other issues here are that our previous numbering the FRH (around 66) was not identical to actual Chothia numbering.
FRL Mismatches (2)
Orientation Mismatches (10)
Metric mismatches (numerous)
Metrics also do not match 100%. For the OCDs this is due to changes in which chain is pulled from the PDB (when multiple chains are possible) because the antibody structures will vary slightly across chains altering the PCA results and the corresponding orientation metrics. For B-factors this is likely because I do not use the same approach. Currently, I report the average B-factor value across the entire loop (including side-chain atoms). I think the previous approach set true/false if any backbone atom passed a threshold because my current approach does not yield as many outliers.
Summary of grafting performance on ~40 Abs
Future Directions