Closed UnixJunkie closed 1 month ago
this issue could be assigned to me
apparently, training XGB models is something like 33% faster than training RFR models w/ 50 trees, so I am unsure about this one
The RFR models are a tiny bit better though; and I did not even optimize mtry (fraction of max_features used)
# RFR 50 trees on 24 cores ----------------------
#FP avgR2 stdR2 medR2
MAP4 0.58 0.11 0.58
ECFP 0.63 0.12 0.64
UCAP 0.64 0.11 0.64
# wallclock time: 29min10s
# XGB on 24 cores -------------------------------
#FP avgR2 stdR2 medR2
MAP4 0.55 0.13 0.55
ECFP 0.61 0.12 0.61
UCAP 0.62 0.11 0.62
# wallclock time 19min
Apparently, mtry=0.1 is optimal for this dataset. This might make training the RFR models significantly faster.
#fp mtry avgR2 stdR2 medR2
MAP4 0.100 0.59 0.11 0.59
ECFP 0.100 0.65 0.11 0.65
UCAP 0.100 0.65 0.10 0.66
Other mtry values might have very similar performance though (mtry is called max_features in sklearn's RFR implementation)
apparently; 12 min for RFR w/ 50 trees, mtry=0.1 and 5xCV
to reach optimal parallelization performance, the algorithm should be completely different:
w/ the proper parallelization algorithm, running 5 folds per protein target, having kept only 20 protein targets, I have a runtime of less than a minute on a 12 cores laptop computer. The model is a 50 trees RFR w/ mtry=0.1.
I use this software to parallelize all experiments: https://academic.oup.com/bioinformatics/article/26/22/2918/227811
It has a git mirror here: https://github.com/UnixJunkie/PAR
xgboost might be good to give you a competition winning model; but it's way too slow (at least to my taste) to get a baseline regressor.
I might contribute a random forest regressor to replace it.