pMHCI data exploration - BA quantitative and = only

It's useful to have a detailed overview of the new data we have before starting training the models. The data main metafeatures' are described in /projects/0/einf2380/data/external/processed/I/BA_pMHCI_human_quantitative_only_eq.csv (updated 29/03/2023) and their 3D models generated with PANDORA are in /projects/0/einf2380/data/pMHCI/features_input_folder/HLA_quantitative, together with their pssms. The notebook that I used to explore the data is src/3_build_db4/GNN/0_metafeatures_exploration.ipynb.

Important: Please answer referring to the numbered bullet point and quoting only the sentence/question you're commenting.

1. Missing data points

[ ] Why are they missing?

From now on the information refers to the actual data in the hdf5 files, i.e. only the data for which we do have the PDBs models (100178 total data points).

2. Measurement type, kind and inequalities

We are using only data with quantitative measurement type, and affinity measurement kind, with measurement_inequality only equal to =.

In the bigger dataset, we have different measurement inequalities (>, <, =). For now we decided to include only the ones above mentioned. The inequalities mean that that measure is either > or < than the reported value. This can make the data very noisy, since such values are not well determined.

3. Allele types and peptides' lengths

In the CSV, the peptides' lengths span from 8 to 21, even if we should have peptides with a maximum length of 15. I checked the original CSV file taken from MHCFlurry 2.0 (S3, you can find it here) and there are no peptides longer than 15 there. The peptides with length > 15 are 88 data points:

Heleen used a different csv to filter those data (from MHCFlurry GitHub repo) and probably there is no filtering for peptides > 15. Anyway, we decided to not filter them out, because they are not a significant number of data points so they should have basically no influence on the training. Addionally, since we are trying to develop a model which should generalize according to the complexes' structure, it may be actually useful to not filter them out thus increasing peptides' length variability (if they have an influence at all).

4. Peptides' clusters

The file /projects/0/einf2380/data/external/processed/I/BA_pMHCI_human_quantitative_all_hla_gibbs_clusters.csv contains the peptides clusters' sets created using GibbsCluster.

cluster_set_10 (containing clusters 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10) has been chosen for the peptides clustering experiments. The cluster name which resulted in the most distant one is the one with the name 3.

[ ] Which is the garbage cluster?

The clusters not included in the test (including NaN ones) should be shuffled between validation and training sets.

If we get bad performances this way, or maybe at a later stage we can a priori try this, we can clusterize the validation and the training sets as well. We'll have a more indicative idea during the training phase of how well the network is generalizing, and it may help the network itself. On the other hand, if we have few data for the test cluster, it may be tricky to do such thing and have enough data for testing (that's the main reason why usually training and validation set are shuffled and not clusterized as well).

5. Trainining, validation, testing

General data overview

Clustering on peptides

Using cluster value 3 for the testing set (from cluster_set_10, as mentioned above). Training set: 92556 samples, 92%

Class 0: 51756 samples, 56%
Class 1: 40800 samples, 44% Testing set: 7622 samples, 8%
Class 0: 4320 samples, 57%
Class 1: 3302 samples, 43%

Clustering on alleles

[ ] More information are needed about how the clustering was performed.

Using cluster value 1 for the testing set. Training set: 89779 samples, 90%

Class 0: 49958 samples, 56%
Class 1: 39821 samples, 44% Testing set: 10399 samples, 10%
Class 0: 6118 samples, 59%
Class 1: 4281 samples, 41%

Experiments

I'll refer here to the validation set as the one used for evaluation during training and to the testing set as the one used after training.

We decided that during experiments we're going to start with no cross-validation, from the worst case scenario possible (the test set must be the most distant one from the training(+validation) set).

Possible general phases for the first rounds: 1.1: data standardization 1.2: weighted the loss function (only for classification) 1.3: increase the net size 1.4: improve features standardization 1.5 same flow with regression task. Note that if we will need MS data as well, since they are only 0s and 1s, we may need to keep classification for adding those data. If results are already good with BA data only, we may want to switch to regression to match the state of the art. Discussion on this point is welcomed. 1.6: cross-validate the best one (at least 2-folds, we need to decide which cluster will be placed in the second fold - the second most distant one) experiment

Experiment 1: data shuffling stratifying on target.
Experiment 2: data will be divided according to clusters, leaving the most distant ones in the test set.
Experiment 3: data will be divided according to alleles, leaving the rare ones in the test set (see the image above for more details).

After this rounds of experiments we can evaluate and investigate the features importance.

In general, most of the experiments should be done on the shuffled data, and when we find out the "best" general configuration we can try out the clustering experiments (2 and 3). This is of course not the only way of proceeding (which depends a lot on what you want to prove), but so far we decided to move in this direction.

We discussed such experiment at the last meetings, but please let me know what do you thing about this planning :)

Do we want to perform cross-validation on training(+validation) only, then testing, or on testing as well? Meaning: do we want to change the testing set as well in each of the k-fold, or only the training(+validation) set?

Regardless of the clustering criteria (alleles, sequence motifs.. several experiments can be done) validation has to be a subset of the same criteria used to cluster peptides. This way we actually make sure that the model is learning. For the first experiment on 9-mers, the validation and training set were shuffled and stratified from the same clusters.

Metrics should be an average over performances on k-fold test sets. The way to achieve it is using the leave-one-cluster out. This is in fact mandatory. If the test set is the same for each trained models we loose the whole principle of clustering the data. The test set has to be unique for each fold, this way we make sure that the final average value is the actual average generalization performances.

Do we want to perform cross-validation based on what? Clusters, alleles, or peptides' lengths?

Cross-validation for now is achieved on different kind of clustering:

based on alleles.
based on sequence motifs for the same allele (my first experiment) and based on all alleles.
The difference between a clustering based on all alleles and a clustering based on sequence motifs from different alleles (first and second bullet point) is that in the first case clusters will be made of different alleles where in the second case clusters will be made of different sequence motifs. We would expect to have sequence motifs specific to alleles, therefore this clustering should be very identical to a clustering based on alleles only. This is something we would have to verify.
Clustering based on peptide length: this one we didn't discuss yet (I believe) but would be very interesting to explore the performances based on the amount of data (9-mers being the more abundant) and CNN performances on very localized spatial features and its generalization to other places of the grid (length invariant).

Please correct me if I'm wrong.

Regardless of the clustering criteria (alleles, sequence motifs.. several experiments can be done) validation has to be a subset of the same criteria used to cluster peptides. This way we actually make sure that the model is learning. For the first experiment on 9-mers, the validation and training set were shuffled and stratified from the same clusters.

Actually, there are no absolute rules on this, you could either do the random shuffling (as you did in the 9-mers experiment) or create the folds stratifying on target (it's not random if the target class is unbalanced, as in our case), or even create the folds according to another variable (e.g. clusters). It can variate a lot among different fields, and it heavily depends on data. That's why I think we should carefully reflect on it and come to an agreed conclusion.

Metrics should be an average over performances on k-fold test sets. The way to achieve it is using the leave-one-cluster out. This is in fact mandatory. If the test set is the same for each trained models we loose the whole principle of clustering the data. The test set has to be unique for each fold, this way we make sure that the final average value is the actual average generalization performances.

It depends if you're performing cross-validation on training(+validation) and testing or only on training(+validation). Also in this case, some researchers do the first thing and others do the latter. We should reflect on which one suits best for our case.

Cross-validation for now is achieved on different kind of clustering:

based on alleles.

based on sequence motifs for the same allele (my first experiment) and based on all alleles.

To which of the two bullet points does the column cluster in the CSV refer?

To which of the two bullet points does the column cluster in the CSV refer?

Second bullet point. The cluster is always referring to the sequence motif cluster.

Ok, I read it. Finally :). There is indeed no single rule for any of the questions. We can best discuss this. But if you use the test sets in the cross-validation, then the final result is too optimistic, because the data from the test sets has been used in the training. You will then need yet another independent test set (the cross validation that has circulating test set is mainly to inspect the variance of the performances). Alternatively, you could set aside a test set that you never use and perform cross validation on the rest of the data (with circulating train/val/test). Then you can report all results. For splitting the data, indeed you can perform experiments with all data shuffled (to show average-case performances) but you can also train where clusters are not overlapping in train/val/test (to show, I guess, worst-scenario performances). Unless the clusters were meant to de-corelate the data (I forgot what they meant :)) , in which case it is indeed the second case: split per clusters. Since the data is imbalanced you can use weighted loss function or balancing. Then, there are multiple ways to make the validation set. Depends on the metrics - accuracy, or Matthew correlation coefficient? Etc, etc

DeepRank / 3D-Vac