Closed gcroci2 closed 1 year ago
After having finished experiments in #134, in which we concluded that our structure-based models are not learning the physics behind the complexes' interactions, but that they are learning (and thus overfitting) the data, we decided to spend some time on evaluating the features importance.
Common features among the following experiments:
distance
(edge feature) and res_type
(node feature, one-hot encoded) as featuresexp_100k_dist_res_type_std_bs16_0
cl_peptide
Dataset (cluster_set_10
of /projects/0/einf2380/data/external/processed/I/BA_pMHCI_human_quantitative_all_hla_gibbs_clusters.csv
)
exp_100k_dist_res_type_std_bs16_cl_peptide_0
cl_allele
Dataset (allele_clustering
of /projects/0/einf2380/data/external/processed/I/BA_pMHCI_human_quantitative_only_eq_alleleclusters_pseudoseq.csv
)
exp_100k_dist_res_type_std_bs16_cl_allele_0
allele_type
Dataset (A, B, C, E)
exp_100k_dist_res_type_std_bs16_cl_allele_C_0
Next steps:
[x] Evaluate features correlation (Pearson). Click on the images to visualize them at the original scale.
Considerations:
irc_total
, bsa
), (res_pI
, res_charge
), (res_size
, res_mass
)irc_nonpolar_polar
, bsa
), (irc_total
, irc_nonpolar_polar
), (polarity_2
, hb_acceptors
), (polarity_3
, hb_donors
), (res_charge
, hb_acceptors
), (res_charge
, hb_donors
), (res_charge
, polarity_2
), (res_charge
, polarity_3
), (res_pI
, hb_donors
), (res_pI
, polarity_3
), (res_type_3
, polarity_2
), (res_type_14
, hb_donors
), (res_type_14
, polarity_3
), (res_type_14
, res_pI
), ``[ ] Try to add same_chain
feature
[ ] Try to add the ones which correlate less with each other
We should evaluate the importance of features present right now in deeprankcore. Captum may be useful for this, see its getting started. An alternative is also to remove one feature at time and evaluate the performance.