Evaluate features importance

DeepRank / 3D-Vac

Personalized cancer vaccine design through 3D modelling boosted geometric learning.

Apache License 2.0

3 stars 0 forks source link

Evaluate features importance #90

Closed gcroci2 closed 1 year ago

gcroci2 commented 2 years ago

We should evaluate the importance of features present right now in deeprankcore. Captum may be useful for this, see its getting started. An alternative is also to remove one feature at time and evaluate the performance.

gcroci2 commented 1 year ago

After having finished experiments in #134, in which we concluded that our structure-based models are not learning the physics behind the complexes' interactions, but that they are learning (and thus overfitting) the data, we decided to spend some time on evaluating the features importance.

It may be that using all the features we have (~30 features, see issue #141 for more details about them and their distributions) gives to the network too much noise and doesn't allow it to use the geometrical features that we provide.
Many features are very likely correlated between each other.

Reduce features to only geometric ones

Common features among the following experiments:

Only distance (edge feature) and res_type (node feature, one-hot encoded) as features
Data used are in data/pMHCI/features_output_folder/GNN/residue/230329/ (generated in #140 )
- Residue-level queries
Naive GNN
Standardization applied to all features
Batch size 16
Cross entropy loss
Adam optimizer
70 epochs, min_epoch 45, earlystop_patience 20, earlystop_maxgap 0.06

Shuffled data

Stratification on target (0: 56%, 1: 44%)
exp_name exp_100k_dist_res_type_std_bs16_0

Clustering on peptides

clustered data on cl_peptide Dataset (cluster_set_10 of /projects/0/einf2380/data/external/processed/I/BA_pMHCI_human_quantitative_all_hla_gibbs_clusters.csv)
- clusters with value 3 are assigned to the test set [%]
- the rest has been shuffled between training and validation, stratifying on target
exp_name exp_100k_dist_res_type_std_bs16_cl_peptide_0

Clustering on (stratified) alleles

clustered data on cl_allele Dataset (allele_clustering of /projects/0/einf2380/data/external/processed/I/BA_pMHCI_human_quantitative_only_eq_alleleclusters_pseudoseq.csv)
- clusters with value 1 are assigned to the test set [%]
- the rest has been shuffled between training and validation, stratifying on target
exp_name exp_100k_dist_res_type_std_bs16_cl_allele_0

Clustering on alleles

clustered data on allele_type Dataset (A, B, C, E)
- clusters with value C are assigned to the test set
- the rest has been shuffled between training and validation, stratifying on target
exp_name exp_100k_dist_res_type_std_bs16_cl_allele_C_0

gcroci2 commented 1 year ago

Next steps:

[x] Evaluate features correlation (Pearson). Click on the images to visualize them at the original scale.

Considerations:
- Features highly correlated (>= 0.95): (irc_total, bsa), (res_pI, res_charge), (res_size, res_mass)
- moderate correlation (0.7 - 0.95): (irc_nonpolar_polar, bsa), (irc_total, irc_nonpolar_polar), (polarity_2, hb_acceptors), (polarity_3, hb_donors), (res_charge, hb_acceptors), (res_charge, hb_donors), (res_charge, polarity_2), (res_charge, polarity_3), (res_pI, hb_donors), (res_pI, polarity_3), (res_type_3, polarity_2), (res_type_14, hb_donors), (res_type_14, polarity_3), (res_type_14, res_pI), ``
[ ] Try to add same_chain feature
[ ] Try to add the ones which correlate less with each other