Open nabenabe0928 opened 3 years ago
Interesting :), I think the analysis in the future should also be extended to the following datasets:
https://archive.ics.uci.edu/ml/datasets/covertype
https://archive.ics.uci.edu/ml/datasets/HIGGS
https://archive.ics.uci.edu/ml/datasets/Poker+Hand
They proved tricky.
FYI, when we use Optuna with a tiny model, we consume around only 150MB. This module is also thread safe.
import optuna
def objective(trial):
x0 = trial.suggest_uniform('x0', -10, 10)
x1 = trial.suggest_uniform('x1', -10, 10)
return x0 ** 2 + x1 ** 2
if __name__ == '__main__':
study = optuna.create_study()
study.optimize(objective, n_trials=5000, n_jobs=4)
I tested the memory usage for the following datasets: | Dataset name | # of features | # of instances | Approx. Datasize [MB] |
---|---|---|---|---|
Covertype | 55 | 581012 | 60 ~ 240 | |
Higgs | 29 | 98050 | 5 ~ 20 | |
Poker-hands | 11 | 1025009 | 22 ~ 90 |
The details of the memory usage are the followings: | Source | Consumption in covertype [GB] | Consumption in higgs [GB] | Consumption in pocker-hand [GB] |
---|---|---|---|---|
Import modules | 0.35 | 0.35 | 0.35 | |
Dask Client | 0.35 | 0.35 | 0.35 | |
Logger (Thread safe) | 0.35 | 0.35 | 0.35 | |
Dataset itself | 0.1 | 0.05 | 0.1 | |
self.categories in InputValidator | 0 | 0 | 0.02 | |
Running of context.Process in multiprocessing module | 0.4 | 0.4 | 0.4 | |
LightGBM | 0.6 | 0.1 | 0.3 | |
CatBoost | 0.8 | 0.1 | 0.6 | |
Random Forest | 1.2 | 0.5 | 1.0 | |
Extra Trees | 1.2 | 0.2 | 1.1 | |
SVM | 0.9 | 0.2 | 0.6 | |
KNN | 0.8 | - | 0.4 | |
Total | 2.0 ~ | 1.5 ~ | 1.7 ~ |
Note that KNN failed in Higgs and some trainings for each dataset were canceled because of memory out error.
This time, I set memory_limit = 4096
, but I somehow got memory out error with lower numbers such 2.5 ~ 3.0 GB.
Probably it is better to check if it works well on the latest branch as well.
This is from #259 by @franchuterivera.
self.Y_optimization[test_indices] = opt_pred
. This ways predictions are sorted and can be used directly by ensemble selection without the need of saving this arrayimport torch
consumes 2Gb of peak virtual memory and the majority of times this happens is for mypy typing. We should encapsulate these calls under typing.TYPE_CHECKING
and only import the strictly needed class from pytorch.Check if we can use generator
instead of np.ndarray
I write down the current memory usage as a memo just in case when we encounter memory leak issues in the future. This post is based on the current implementation.
All the information was obtained by:
and the logger which I set for the debugging. Note that I also added
time.sleep(0.5)
before and after the line of interest to eliminate the possibilities of the influences from other elements and checked each line in detail.