abduskhazi / PL-Binding-Affinity-Prediction-using-ML

This repository is maintained for the documentation and coding of the MSc project @ Bioinformatics Lab Uni-Freiburg.
MIT License
1 stars 0 forks source link

Weighting datapoints using measured resolution #7

Open abduskhazi opened 2 years ago

abduskhazi commented 2 years ago

According to a pdb article, the smaller the resolution the more accurate the measurement. The measured resolution is given in the PDB Databank INDEX file. I duplicated the data points inversely proportional to their respective resolution. Now the R2 score seems to be improved by small amount fraction.

We get ~62,000 data points to train after the duplication

simonbray commented 2 years ago

Good idea!

simonbray commented 2 years ago

I am curious whether you are validating against the original dataset (20k points) or the oversampled dataset with 62k?

abduskhazi commented 2 years ago

Hi Simon, The values I reported above are for the oversampled data. The number of data points is

In the recent commits, however, I am not doing oversampling. I use the fit(X, y, sample_weights) function as it has weights already in it. The validation error was getting better if I repeated the data points. I realized, however, that if I repeat the data points the trees may not be completely independent(uncorrelated), hence it will affect generalization. Please check - https://github.com/abduskhazi/MSc-Project/blob/eb9d7425802e21bacf885f9f0d54c89f745db920/model/random_forest_regresser.py#L53-L54 You can uncomment this to get a higher validation accuracy.

I want to discuss the following question with Alireza tomorrow - If I just use the sample weights in my fit function, there is not much difference in the validation error. Why?

Regards,

abduskhazi commented 2 years ago

Execution ID = 2850170738

When weight = max_resolution/resolution[complex_name], we get ~39,000 data points to train after the duplication.

Using the hyperparameters n_estimators=400; max_features=0.2; min_samples_leaf=2, the Random forest regressor gives (provided you duplication the data)

The oob score is erroneously high here. This is because of repeated data points. Many points are both inside the selected bag and outside it. Hence, we should not use this measure if we are doing datapoint duplication.

abduskhazi commented 2 years ago

Weight distribution according to the function data_weights = max_resolution/resolution[complex_name]

weight_distribution

abduskhazi commented 2 years ago

I normalized the weights before training our random forest model. https://github.com/abduskhazi/MSc-Project/blob/24436059aff2143cfaa7f8f2bc3d89d4617d8ff8/model/random_forest_regresser.py#L90-L91 The R^2 score and oob score did not show any improvement: (Execution ID = 844526751)