[memo] High memory consumption and the places of doubts

nabenabe0928 commented 3 years ago

I write down the current memory usage as a memo just in case when we encounter memory leak issues in the future. This post is based on the current implementation.

When we run a dataset with the size of 300B, AutoPytorch consumes ~1.5GB and the followings are the major source of the memory consumptions:	Source	Consumption [GB]
Import modules	0.35
Dask Client	0.35
Logger (Thread safe)	0.4
Running of context.Process in multiprocessing module	0.4
Model	0 ~ inf
Total	1.5 ~ inf

When we run a dataset with the size of 300MB (400,000 instances x 80 features) such as Albert, AutoPytorch consumes ~2.5GB and the followings are the major source of the memory consumptions:	Source	Consumption [GB]
Import modules	0.35
Dask Client	0.35
Logger (Thread safe)	0.4
Dataset itself	0.3
self.categories in InputValidator	0.3
Running of context.Process in multiprocessing module	0.4
Model (e.g. LightGBM)	0.4 ~ inf
Total	2.5 ~ inf

All the information was obtained by:

$ mprof run --include-children python -m examples.tabular.20_basics.example_tabular_classification

and the logger which I set for the debugging. Note that I also added time.sleep(0.5) before and after the line of interest to eliminate the possibilities of the influences from other elements and checked each line in detail.

ArlindKadra commented 3 years ago

Interesting :), I think the analysis in the future should also be extended to the following datasets:

https://archive.ics.uci.edu/ml/datasets/covertype
https://archive.ics.uci.edu/ml/datasets/HIGGS https://archive.ics.uci.edu/ml/datasets/Poker+Hand

They proved tricky.

nabenabe0928 commented 3 years ago

FYI, when we use Optuna with a tiny model, we consume around only 150MB. This module is also thread safe.

import optuna

def objective(trial):
    x0 = trial.suggest_uniform('x0', -10, 10)
    x1 = trial.suggest_uniform('x1', -10, 10)
    return x0 ** 2 + x1 ** 2

if __name__ == '__main__':
    study = optuna.create_study()
    study.optimize(objective, n_trials=5000, n_jobs=4)

nabenabe0928 commented 3 years ago

I tested the memory usage for the following datasets:	Dataset name	# of features	# of instances
Covertype	55	581012	60 ~ 240
Higgs	29	98050	5 ~ 20
Poker-hands	11	1025009	22 ~ 90

The details of the memory usage are the followings:	Source	Consumption in covertype [GB]	Consumption in higgs [GB]
Import modules	0.35	0.35	0.35
Dask Client	0.35	0.35	0.35
Logger (Thread safe)	0.35	0.35	0.35
Dataset itself	0.1	0.05	0.1
self.categories in InputValidator	0	0	0.02
Running of context.Process in multiprocessing module	0.4	0.4	0.4
LightGBM	0.6	0.1	0.3
CatBoost	0.8	0.1	0.6
Random Forest	1.2	0.5	1.0
Extra Trees	1.2	0.2	1.1
SVM	0.9	0.2	0.6
KNN	0.8	-	0.4
Total	2.0 ~	1.5 ~	1.7 ~

Note that KNN failed in Higgs and some trainings for each dataset were canceled because of memory out error. This time, I set memory_limit = 4096, but I somehow got memory out error with lower numbers such 2.5 ~ 3.0 GB. Probably it is better to check if it works well on the latest branch as well.

nabenabe0928 commented 3 years ago

This is from #259 by @franchuterivera.

[x] We should not let the datamanager actively reside in memory when we are not using it. For example, there is no need to have a datamanager in smbo.
[x] Also, after search has save the datamanager to disk, we can delete and garbage collect it.
[x] We also should datacollect and challenge the need of datamanager in the evaluator
[ ] We should improve the cross validation handling of the out of fold predictions. Rather than having a list that contains the OOF predicitons here we should have a fixed array of n_samples created once at the beginning. OOF predictions from the k-fold model should be added smartly to this pre-existing array, something like self.Y_optimization[test_indices] = opt_pred. This ways predictions are sorted and can be used directly by ensemble selection without the need of saving this array
[ ] Calculating the train loss should be optional, not done by default here. We should prevent doing predict if not strictly needed.
[ ] As reported already by @nabenabe0928 the biggest contribution comes from import files. In particular, just doing import torch consumes 2Gb of peak virtual memory and the majority of times this happens is for mypy typing. We should encapsulate these calls under typing.TYPE_CHECKING and only import the strictly needed class from pytorch.

nabenabe0928 commented 2 years ago

Check if we can use generator instead of np.ndarray

automl / Auto-PyTorch

[memo] High memory consumption and the places of doubts #180