DeepRank / 3D-Vac

Personalized cancer vaccine design through 3D modelling boosted geometric learning.
Apache License 2.0
3 stars 0 forks source link

pMHCI data exploration - BA quantitative and = only #113

Closed gcroci2 closed 3 months ago

gcroci2 commented 1 year ago

It's useful to have a detailed overview of the new data we have before starting training the models. The data main metafeatures' are described in /projects/0/einf2380/data/external/processed/I/BA_pMHCI_human_quantitative_only_eq.csv (updated 29/03/2023) and their 3D models generated with PANDORA are in /projects/0/einf2380/data/pMHCI/features_input_folder/HLA_quantitative, together with their pssms. The notebook that I used to explore the data is src/3_build_db4/GNN/0_metafeatures_exploration.ipynb.

Important: Please answer referring to the numbered bullet point and quoting only the sentence/question you're commenting.

1. Missing data points

image

From now on the information refers to the actual data in the hdf5 files, i.e. only the data for which we do have the PDBs models (100178 total data points).

2. Measurement type, kind and inequalities

We are using only data with quantitative measurement type, and affinity measurement kind, with measurement_inequality only equal to =.

In the bigger dataset, we have different measurement inequalities (>, <, =). For now we decided to include only the ones above mentioned. The inequalities mean that that measure is either > or < than the reported value. This can make the data very noisy, since such values are not well determined.

3. Allele types and peptides' lengths

image

In the CSV, the peptides' lengths span from 8 to 21, even if we should have peptides with a maximum length of 15. I checked the original CSV file taken from MHCFlurry 2.0 (S3, you can find it here) and there are no peptides longer than 15 there. The peptides with length > 15 are 88 data points:

image

Heleen used a different csv to filter those data (from MHCFlurry GitHub repo) and probably there is no filtering for peptides > 15. Anyway, we decided to not filter them out, because they are not a significant number of data points so they should have basically no influence on the training. Addionally, since we are trying to develop a model which should generalize according to the complexes' structure, it may be actually useful to not filter them out thus increasing peptides' length variability (if they have an influence at all).

4. Peptides' clusters

The file /projects/0/einf2380/data/external/processed/I/BA_pMHCI_human_quantitative_all_hla_gibbs_clusters.csv contains the peptides clusters' sets created using GibbsCluster.

cluster_set_10 (containing clusters 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10) has been chosen for the peptides clustering experiments. The cluster name which resulted in the most distant one is the one with the name 3.

The clusters not included in the test (including NaN ones) should be shuffled between validation and training sets.

If we get bad performances this way, or maybe at a later stage we can a priori try this, we can clusterize the validation and the training sets as well. We'll have a more indicative idea during the training phase of how well the network is generalizing, and it may help the network itself. On the other hand, if we have few data for the test cluster, it may be tricky to do such thing and have enough data for testing (that's the main reason why usually training and validation set are shuffled and not clusterized as well).

5. Trainining, validation, testing

General data overview

image

Clustering on peptides

image

Using cluster value 3 for the testing set (from cluster_set_10, as mentioned above). Training set: 92556 samples, 92%

Clustering on alleles

image

Using cluster value 1 for the testing set. Training set: 89779 samples, 90%

Experiments

I'll refer here to the validation set as the one used for evaluation during training and to the testing set as the one used after training.

Possible general phases for the first rounds: 1.1: data standardization 1.2: weighted the loss function (only for classification) 1.3: increase the net size 1.4: improve features standardization 1.5 same flow with regression task. Note that if we will need MS data as well, since they are only 0s and 1s, we may need to keep classification for adding those data. If results are already good with BA data only, we may want to switch to regression to match the state of the art. Discussion on this point is welcomed. 1.6: cross-validate the best one (at least 2-folds, we need to decide which cluster will be placed in the second fold - the second most distant one) experiment

After this rounds of experiments we can evaluate and investigate the features importance.

In general, most of the experiments should be done on the shuffled data, and when we find out the "best" general configuration we can try out the clustering experiments (2 and 3). This is of course not the only way of proceeding (which depends a lot on what you want to prove), but so far we decided to move in this direction.

We discussed such experiment at the last meetings, but please let me know what do you thing about this planning :)

DanLep97 commented 1 year ago

Do we want to perform cross-validation on training(+validation) only, then testing, or on testing as well? Meaning: do we want to change the testing set as well in each of the k-fold, or only the training(+validation) set?

  • Regardless of the clustering criteria (alleles, sequence motifs.. several experiments can be done) validation has to be a subset of the same criteria used to cluster peptides. This way we actually make sure that the model is learning. For the first experiment on 9-mers, the validation and training set were shuffled and stratified from the same clusters.
  • Metrics should be an average over performances on k-fold test sets. The way to achieve it is using the leave-one-cluster out. This is in fact mandatory. If the test set is the same for each trained models we loose the whole principle of clustering the data. The test set has to be unique for each fold, this way we make sure that the final average value is the actual average generalization performances.

Do we want to perform cross-validation based on what? Clusters, alleles, or peptides' lengths?

Cross-validation for now is achieved on different kind of clustering:

Please correct me if I'm wrong.

gcroci2 commented 1 year ago
  • Regardless of the clustering criteria (alleles, sequence motifs.. several experiments can be done) validation has to be a subset of the same criteria used to cluster peptides. This way we actually make sure that the model is learning. For the first experiment on 9-mers, the validation and training set were shuffled and stratified from the same clusters.

Actually, there are no absolute rules on this, you could either do the random shuffling (as you did in the 9-mers experiment) or create the folds stratifying on target (it's not random if the target class is unbalanced, as in our case), or even create the folds according to another variable (e.g. clusters). It can variate a lot among different fields, and it heavily depends on data. That's why I think we should carefully reflect on it and come to an agreed conclusion.

Metrics should be an average over performances on k-fold test sets. The way to achieve it is using the leave-one-cluster out. This is in fact mandatory. If the test set is the same for each trained models we loose the whole principle of clustering the data. The test set has to be unique for each fold, this way we make sure that the final average value is the actual average generalization performances.

It depends if you're performing cross-validation on training(+validation) and testing or only on training(+validation). Also in this case, some researchers do the first thing and others do the latter. We should reflect on which one suits best for our case.

gcroci2 commented 1 year ago

Cross-validation for now is achieved on different kind of clustering:

  • based on alleles.
  • based on sequence motifs for the same allele (my first experiment) and based on all alleles.

To which of the two bullet points does the column cluster in the CSV refer?

DanLep97 commented 1 year ago

To which of the two bullet points does the column cluster in the CSV refer?

Second bullet point. The cluster is always referring to the sequence motif cluster.

sonjageorgievska commented 1 year ago

Ok, I read it. Finally :). There is indeed no single rule for any of the questions. We can best discuss this. But if you use the test sets in the cross-validation, then the final result is too optimistic, because the data from the test sets has been used in the training. You will then need yet another independent test set (the cross validation that has circulating test set is mainly to inspect the variance of the performances). Alternatively, you could set aside a test set that you never use and perform cross validation on the rest of the data (with circulating train/val/test). Then you can report all results. For splitting the data, indeed you can perform experiments with all data shuffled (to show average-case performances) but you can also train where clusters are not overlapping in train/val/test (to show, I guess, worst-scenario performances). Unless the clusters were meant to de-corelate the data (I forgot what they meant :)) , in which case it is indeed the second case: split per clusters. Since the data is imbalanced you can use weighted loss function or balancing. Then, there are multiple ways to make the validation set. Depends on the metrics - accuracy, or Matthew correlation coefficient? Etc, etc