Open abduskhazi opened 3 years ago
Good idea!
I am curious whether you are validating against the original dataset (20k points) or the oversampled dataset with 62k?
Hi Simon, The values I reported above are for the oversampled data. The number of data points is
weight = (max_resolution + 1) - resolution[complex_name]
weight = max_resolution/resolution[complex_name]
In the recent commits, however, I am not doing oversampling. I use the fit(X, y, sample_weights)
function as it has weights already in it. The validation error was getting better if I repeated the data points. I realized, however, that if I repeat the data points the trees may not be completely independent(uncorrelated), hence it will affect generalization. Please check -
https://github.com/abduskhazi/MSc-Project/blob/eb9d7425802e21bacf885f9f0d54c89f745db920/model/random_forest_regresser.py#L53-L54
You can uncomment this to get a higher validation accuracy.
I want to discuss the following question with Alireza tomorrow - If I just use the sample weights in my fit function, there is not much difference in the validation error. Why?
Regards,
Execution ID = 2850170738
When weight = max_resolution/resolution[complex_name]
, we get ~39,000 data points to train after the duplication.
Using the hyperparameters n_estimators=400
; max_features=0.2
; min_samples_leaf=2
, the Random forest regressor gives (provided you duplication the data)
The oob score is erroneously high here. This is because of repeated data points. Many points are both inside the selected bag and outside it. Hence, we should not use this measure if we are doing datapoint duplication.
Weight distribution according to the function
data_weights = max_resolution/resolution[complex_name]
I normalized the weights before training our random forest model. https://github.com/abduskhazi/MSc-Project/blob/24436059aff2143cfaa7f8f2bc3d89d4617d8ff8/model/random_forest_regresser.py#L90-L91 The R^2 score and oob score did not show any improvement: (Execution ID = 844526751)
According to a pdb article, the smaller the resolution the more accurate the measurement. The measured resolution is given in the PDB Databank INDEX file. I duplicated the data points inversely proportional to their respective resolution. Now the R2 score seems to be improved by small amount fraction.
We get ~62,000 data points to train after the duplication