add/replace baseline regressor by a RFR

PatWalters / benchmark_map4

Benchmarking the MAP4 fingerprint in regression models

MIT License

4 stars 2 forks source link

add/replace baseline regressor by a RFR #4

Closed UnixJunkie closed 1 month ago

UnixJunkie commented 1 month ago

xgboost might be good to give you a competition winning model; but it's way too slow (at least to my taste) to get a baseline regressor.

I might contribute a random forest regressor to replace it.

UnixJunkie commented 1 month ago

this issue could be assigned to me

UnixJunkie commented 1 month ago

apparently, training XGB models is something like 33% faster than training RFR models w/ 50 trees, so I am unsure about this one

UnixJunkie commented 1 month ago

The RFR models are a tiny bit better though; and I did not even optimize mtry (fraction of max_features used)


# RFR 50 trees on 24 cores ----------------------

#FP avgR2 stdR2 medR2
MAP4 0.58 0.11 0.58
ECFP 0.63 0.12 0.64
UCAP 0.64 0.11 0.64

# wallclock time: 29min10s

# XGB on 24 cores -------------------------------

#FP avgR2 stdR2 medR2
MAP4 0.55 0.13 0.55
ECFP 0.61 0.12 0.61
UCAP 0.62 0.11 0.62

# wallclock time 19min

UnixJunkie commented 1 month ago

Apparently, mtry=0.1 is optimal for this dataset. This might make training the RFR models significantly faster.

#fp mtry avgR2 stdR2 medR2
MAP4 0.100 0.59 0.11 0.59
ECFP 0.100 0.65 0.11 0.65
UCAP 0.100 0.65 0.10 0.66

UnixJunkie commented 1 month ago

Other mtry values might have very similar performance though (mtry is called max_features in sklearn's RFR implementation)

UnixJunkie commented 1 month ago

apparently; 12 min for RFR w/ 50 trees, mtry=0.1 and 5xCV

UnixJunkie commented 1 month ago

to reach optimal parallelization performance, the algorithm should be completely different:

train each model on each fold of each protein target independently
train all those in parallel using the maximum number of cores available Currently; sklearn parallelization is just over the trees of a single RFR model... This is quite inefficient as shown by htop/ CPU Graph.

UnixJunkie commented 1 month ago

w/ the proper parallelization algorithm, running 5 folds per protein target, having kept only 20 protein targets, I have a runtime of less than a minute on a 12 cores laptop computer. The model is a 50 trees RFR w/ mtry=0.1.

I use this software to parallelize all experiments: https://academic.oup.com/bioinformatics/article/26/22/2918/227811

It has a git mirror here: https://github.com/UnixJunkie/PAR