Currently has 28 parameters to be configured (many of which are only utilized conditional on others)
To Do:
[x] Use ColumnTransformer to decide on preprocessing per-column (feature)
Each feature is marked as 'none', 'standardscaler', 'minmaxscaler' (+ 2x continuous bounds (?) ), 'normalizer', 'PCA'
Note: Normalizer and PCA should use a single instance! Eg. for the ColumnTransformer each needs to have the indices marked with PCA or Normalizer, rather than creating a new one for each feature.
[x] Set a time limit on how long an evaluation is allowed to take.
Some configurations take more time to train and test, in exchange for often a better score. This would make this less trivial.
[x] Figure out a good limit (eg. default configuration x 2)
[x] Check whether the bounds for the variables are reasonable
[x] Specifically: there are quite a few features in with a range of [0, ∞], they are currently set to [0, 10]. Should they utilize a lognormal distribution instead (for example?)
[x] Add to run_experiment.py
[x] Test whether it runs properly without exceptions
[x] Make sure XGBoost can properly decide on the class when using a binary classification metric
Does it automatically apply one-vs-rest or one-vs-one? Or do we need to make use of the wrappers in sklearn.multiclass?
Note: the latter would add more features again / make existing features have more values.
Currently has 28 parameters to be configured (many of which are only utilized conditional on others)
To Do:
ColumnTransformer
to decide on preprocessing per-column (feature)Normalizer
andPCA
should use a single instance! Eg. for theColumnTransformer
each needs to have the indices marked withPCA
orNormalizer
, rather than creating a new one for each feature.run_experiment.py
sklearn.multiclass
?