Preprocessing Pipeline - Githubissues

ayushkarnawat commented 4 years ago

The current data preprocessing pipeline uses the following steps:

Converts each variant peptide string into its canonical SMILES format (in a linear fashion)
Embeds/optimizesthe 3D coordinates using RDKit's in-built ETKDG optimizer
Computes features
- Molecular (atom-level) features
- Adjacency matrix
- Relative position matrix

As an example, let's take a look at the variant TNTY. The variant gets converted to the following SMILES string: N[C@@]([H])([C@]([H])(O2)C)C(=O)N[C@@]([H])(CC(=O)N)C(=O)N[C@@]([H])([C@]([H])(O)C)C(=O)N[C@@]([H])(Cc1ccc(O)cc1)C(=O)2, and has the following 3D coordinates:

     RDKit          3D

 34 35  0  0  0  0  0  0  0  0999 V2000
   -2.2998   -4.2226   -0.6024 N   0  0  0  0  0  0  0  0  0  0  0  0
   -2.0202   -2.9076    0.0111 C   0  0  1  0  0  0  0  0  0  0  0  0
   -0.4985   -2.9587    0.2720 C   0  0  2  0  0  0  0  0  0  0  0  0
   -0.0619   -1.8793    0.9854 O   0  0  0  0  0  0  0  0  0  0  0  0
    0.2750   -3.3734   -0.9304 C   0  0  0  0  0  0  0  0  0  0  0  0
   -2.4082   -1.9090   -0.9726 C   0  0  0  0  0  0  0  0  0  0  0  0
   -1.6246   -1.8739   -2.0016 O   0  0  0  0  0  0  0  0  0  0  0  0
   -3.4571   -0.9952   -1.0100 N   0  0  0  0  0  0  0  0  0  0  0  0
   -3.9259   -0.0313   -0.0767 C   0  0  1  0  0  0  0  0  0  0  0  0
   -3.4952   -0.2728    1.3616 C   0  0  0  0  0  0  0  0  0  0  0  0
   -4.0343    0.7726    2.2491 C   0  0  0  0  0  0  0  0  0  0  0  0
   -4.7476    1.6840    1.7989 O   0  0  0  0  0  0  0  0  0  0  0  0
   -3.7390    0.7407    3.6243 N   0  0  0  0  0  0  0  0  0  0  0  0
   -3.6982    1.3767   -0.4885 C   0  0  0  0  0  0  0  0  0  0  0  0
   -4.7027    1.9242   -1.0668 O   0  0  0  0  0  0  0  0  0  0  0  0
   -2.5628    2.1634   -0.3329 N   0  0  0  0  0  0  0  0  0  0  0  0
   -1.2072    1.8855   -0.7962 C   0  0  1  0  0  0  0  0  0  0  0  0
   -0.7634    2.9824   -1.7802 C   0  0  2  0  0  0  0  0  0  0  0  0
   -1.6733    2.9417   -2.8308 O   0  0  0  0  0  0  0  0  0  0  0  0
    0.6425    2.8014   -2.2307 C   0  0  0  0  0  0  0  0  0  0  0  0
   -0.2625    1.9115    0.3227 C   0  0  0  0  0  0  0  0  0  0  0  0
   -0.5984    2.5764    1.3636 O   0  0  0  0  0  0  0  0  0  0  0  0
    1.0046    1.2752    0.3904 N   0  0  0  0  0  0  0  0  0  0  0  0
    1.2617   -0.0889   -0.0691 C   0  0  1  0  0  0  0  0  0  0  0  0
    2.5586   -0.2613   -0.7710 C   0  0  0  0  0  0  0  0  0  0  0  0
    3.7865    0.0447   -0.0569 C   0  0  0  0  0  0  0  0  0  0  0  0
    4.2734    1.3325   -0.0775 C   0  0  0  0  0  0  0  0  0  0  0  0
    5.4405    1.6941    0.5547 C   0  0  0  0  0  0  0  0  0  0  0  0
    6.1241    0.7084    1.2244 C   0  0  0  0  0  0  0  0  0  0  0  0
    7.3271    0.9988    1.8969 O   0  0  0  0  0  0  0  0  0  0  0  0
    5.6773   -0.5906    1.2724 C   0  0  0  0  0  0  0  0  0  0  0  0
    4.5064   -0.9068    0.6264 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.9824   -1.0513    1.0248 C   0  0  0  0  0  0  0  0  0  0  0  0
    1.7969   -1.0264    1.9873 O   0  0  0  0  0  0  0  0  0  0  0  0
  2  1  1  6
  2  3  1  0
  3  4  1  0
  3  5  1  6
  2  6  1  0
  6  7  2  0
  6  8  1  0
  8  9  1  0
  9 10  1  1
 10 11  1  0
 11 12  2  0
 11 13  1  0
  9 14  1  0
 14 15  2  0
 14 16  1  0
 17 16  1  1
 17 18  1  0
 18 19  1  0
 18 20  1  6
 17 21  1  0
 21 22  2  0
 21 23  1  0
 23 24  1  0
 24 25  1  1
 25 26  1  0
 26 27  2  0
 27 28  1  0
 28 29  2  0
 29 30  1  0
 29 31  1  0
 31 32  2  0
 24 33  1  0
 33 34  2  0
 32 26  1  0
 33  4  1  0
M  END
>  <Fitness>  (1) 
0.0

$$$$

The respective BasePreprocessor instance, converts these coordinates into 3 feature sets:

Molecular features
Adjacency Matrix
Relative Position Matrix NOTE: Usually, this is computed as separate XYZ relative positions between each atom (shape=(n_atoms, n_atoms,3)). However, for the sake of the visualization, the L2 norm was calculated. Essentially, it is a distance matrix.

This pipeline (in its current format) seems invalid as having a linear chemical representation of the variant residues does not account for actual protein stability, and thus, cannot accurately represent protein fitness.

Rather, since each variant has modifications at certain residues locations (within the bigger PDB file), it would make more sense to replace and modify each residue within the PDB file and optimize the energy of the whole PDB together. This would make for a more accurate representation of the variant(s) during training, and, perhaps, lead to more accurate results.

ayushkarnawat commented 4 years ago

Alternatively, instead of optimizing coordinates (as this is quite a slow process), an educated guess of the positions (aka length, angles, and dihedral angles) of the rotamers, would likely be good enough for our purposes. Related to #4.

ayushkarnawat commented 4 years ago

Closed via #8

ayushkarnawat / profit

Preprocessing Pipeline #2