Closed ayushkarnawat closed 4 years ago
Alternatively, instead of optimizing coordinates (as this is quite a slow process), an educated guess of the positions (aka length, angles, and dihedral angles) of the rotamers, would likely be good enough for our purposes. Related to #4.
Closed via #8
The current data preprocessing pipeline uses the following steps:
As an example, let's take a look at the variant
TNTY
. The variant gets converted to the following SMILES string:N[C@@]([H])([C@]([H])(O2)C)C(=O)N[C@@]([H])(CC(=O)N)C(=O)N[C@@]([H])([C@]([H])(O)C)C(=O)N[C@@]([H])(Cc1ccc(O)cc1)C(=O)2
, and has the following 3D coordinates:The respective
BasePreprocessor
instance, converts these coordinates into 3 feature sets:Molecular features
Adjacency Matrix
Relative Position Matrix NOTE: Usually, this is computed as separate XYZ relative positions between each atom (shape=(n_atoms, n_atoms,3)). However, for the sake of the visualization, the L2 norm was calculated. Essentially, it is a distance matrix.
This pipeline (in its current format) seems invalid as having a linear chemical representation of the variant residues does not account for actual protein stability, and thus, cannot accurately represent protein fitness.
Rather, since each variant has modifications at certain residues locations (within the bigger PDB file), it would make more sense to replace and modify each residue within the PDB file and optimize the energy of the whole PDB together. This would make for a more accurate representation of the variant(s) during training, and, perhaps, lead to more accurate results.