Closed shihgianlee closed 2 years ago
It turned out that I didn't assign enough memory to it. The ml_memory_limit
does work. Even after I increased the memory limit to > 600GB, I still could not make it very far until I get the error again.
automl = autosklearn.classification.AutoSklearnClassifier(
time_left_for_this_task=3600,
per_run_time_limit=1900,
ml_memory_limit=600*1024,
ensemble_size=1,
ensemble_memory_limit=7*1024,
initial_configurations_via_metalearning=0,
include_preprocessors=["no_preprocessing"],
tmp_folder='./tmp/',
output_folder='./out/',
delete_output_folder_after_terminate=False,
delete_tmp_folder_after_terminate=False,
I am surprised that auto-sklearn consumes so much memory for 400K rows of data. A single XGBoost instance can finish training pretty quickly on a medium machine. I can see the value of auto-sklearn. But, it is discouraging that it requires so much memory for not so large dataset.
I would like to give it another try if someone can point out how I can save some memory or if I am doing something wrong.
Hi @shihgianlee thanks a lot for reporting this issue. I'm really unsure why this happens as 6GB for 400k instances sounds sufficient.
Two steps to move forward:
Also, out of curiosity, how many attributes does your dataset have?
Hi @mfeurer If I remembered correctly, I subsampled 5K rows of data and used 10 GB memory. It didn't throw memory error but was taking a long time to complete. I gave up waiting after an hour, if I remembered correctly. I only have 5 attributes.
Hello @mfeurer. I am facing the same issue while tunning autosklearn on kaggle. the dataset is only 2.2 GB. About 400k rows as well, but only 4 columns. Locally I have seen sklearn handle bigger datasets with less memory. Dont know if this is a cloud related issue.
Hi @shihgianlee , @ach4l,
Sorry it's been a while, but to clarify, it seems these issues only happen in cloud based infrastructure like GCP and Kaggle? Do these issues also happen locally?
While we don't test on cloud infrastructure beyond unittesting on Github's actions, it would be interesting to find out what the root cause of these memory issues is.
Hi @shihgianlee , @ach4l, @shihgianlee I faced a very similar issue. Here are the deatils:
Dataset:
Size: 197,9 MB
Columns: 89
Rows: 501808
Init params:
automl = autosklearn.regression.AutoSklearnRegressor(
time_left_for_this_task=3600,
per_run_time_limit=360,
memory_limit=27000
)
df_cv_results
mean_test_score mean_fit_time params rank_test_scores status budgets ... param_regressor:libsvm_svr:gamma param_regressor:mlp:validation_fraction param_regressor:sgd:epsilon param_regressor:sgd:eta0 param_regressor:sgd:l1_ratio param_regressor:sgd:power_t
1 0.001069 206.981234 {'data_preprocessing:categorical_transformer:c... 1 Success 0.0 ... NaN NaN NaN NaN NaN NaN
12 0.000141 21.207443 {'data_preprocessing:categorical_transformer:c... 2 Success 0.0 ... NaN NaN NaN NaN NaN NaN
7 0.000014 12.729235 {'data_preprocessing:categorical_transformer:c... 3 Success 0.0 ... NaN NaN NaN NaN NaN NaN
0 0.000000 360.100346 {'data_preprocessing:categorical_transformer:c... 4 Timeout 0.0 ... NaN NaN NaN NaN NaN NaN
15 0.000000 9.028285 {'data_preprocessing:categorical_transformer:c... 4 Memout 0.0 ... NaN NaN NaN NaN NaN NaN
25 0.000000 4.793600 {'data_preprocessing:categorical_transformer:c... 4 Memout 0.0 ... NaN NaN NaN NaN NaN NaN
24 0.000000 6.720857 {'data_preprocessing:categorical_transformer:c... 4 Memout 0.0 ... NaN NaN NaN NaN NaN NaN
23 0.000000 360.019049 {'data_preprocessing:categorical_transformer:c... 4 Timeout 0.0 ... NaN NaN NaN NaN NaN NaN
22 0.000000 31.379792 {'data_preprocessing:categorical_transformer:c... 4 Memout 0.0 ... NaN NaN NaN NaN NaN NaN
21 0.000000 16.599984 {'data_preprocessing:categorical_transformer:c... 4 Memout 0.0 ... NaN 0.1 NaN NaN NaN NaN
20 0.000000 360.116118 {'data_preprocessing:categorical_transformer:c... 4 Timeout 0.0 ... NaN NaN NaN NaN NaN NaN
19 0.000000 9.361809 {'data_preprocessing:categorical_transformer:c... 4 Memout 0.0 ... NaN NaN NaN NaN NaN NaN
18 0.000000 5.814345 {'data_preprocessing:categorical_transformer:c... 4 Memout 0.0 ... NaN NaN NaN NaN NaN NaN
17 0.000000 360.115730 {'data_preprocessing:categorical_transformer:c... 4 Timeout 0.0 ... 0.032332 NaN NaN NaN NaN NaN
16 0.000000 6.615313 {'data_preprocessing:categorical_transformer:c... 4 Memout 0.0 ... NaN NaN NaN NaN NaN NaN
14 0.000000 360.080400 {'data_preprocessing:categorical_transformer:c... 4 Timeout 0.0 ... NaN NaN NaN NaN NaN NaN
13 0.000000 360.043842 {'data_preprocessing:categorical_transformer:c... 4 Timeout 0.0 ... NaN NaN NaN NaN NaN NaN
11 0.000000 6.372612 {'data_preprocessing:categorical_transformer:c... 4 Memout 0.0 ... NaN NaN NaN NaN NaN NaN
9 0.000000 19.444851 {'data_preprocessing:categorical_transformer:c... 4 Memout 0.0 ... NaN NaN NaN NaN NaN NaN
8 0.000000 8.804391 {'data_preprocessing:categorical_transformer:c... 4 Memout 0.0 ... NaN NaN NaN NaN NaN NaN
6 0.000000 360.117929 {'data_preprocessing:categorical_transformer:c... 4 Timeout 0.0 ... NaN NaN NaN NaN NaN NaN
5 0.000000 8.347663 {'data_preprocessing:categorical_transformer:c... 4 Memout 0.0 ... NaN NaN NaN NaN NaN NaN
3 0.000000 5.497032 {'data_preprocessing:categorical_transformer:c... 4 Memout 0.0 ... NaN NaN NaN NaN NaN NaN
2 0.000000 5.126574 {'data_preprocessing:categorical_transformer:c... 4 Memout 0.0 ... 0.002623 NaN NaN NaN NaN NaN
27 0.000000 217.114209 {'data_preprocessing:categorical_transformer:c... 4 Timeout 0.0 ... NaN NaN NaN NaN NaN NaN
4 -0.002910 62.450895 {'data_preprocessing:categorical_transformer:c... 26 Success 0.0 ... NaN NaN NaN NaN NaN NaN
26 -0.007268 20.792192 {'data_preprocessing:categorical_transformer:c... 27 Success 0.0 ... NaN NaN 0.000047 NaN 0.018917 NaN
10 -456.128305 257.670972 {'data_preprocessing:categorical_transformer:c... 28 Success 0.0 ... NaN 0.1 NaN NaN NaN NaN
automl.leaderboard
rank ensemble_weight type cost duration config_id train_loss seed start_time end_time budget status data_preprocessors feature_preprocessors balancing_strategy config_origin
model_id
3 1 0.68 gradient_boosting 0.998931 206.981234 2 0.950685 0 1.631515e+09 1.631515e+09 0.0 StatusType.SUCCESS [one_hot_encoding, no_coalescense, none] [select_rates_regression] None Initial design
14 2 0.32 gradient_boosting 0.999859 21.207443 13 0.999518 0 1.631516e+09 1.631516e+09 0.0 StatusType.SUCCESS [one_hot_encoding, minority_coalescer, robust_... [no_preprocessing] None Initial design
9 3 0.00 gradient_boosting 0.999986 12.729235 8 0.999978 0 1.631516e+09 1.631516e+09 0.0 StatusType.SUCCESS [one_hot_encoding, minority_coalescer, minmax] [select_rates_regression] None Initial design
6 4 0.00 gradient_boosting 1.002910 62.450895 5 0.964131 0 1.631515e+09 1.631515e+09 0.0 StatusType.SUCCESS [no_encoding, no_coalescense, robust_scaler] [feature_agglomeration] None Initial design
28 5 0.00 sgd 1.007268 20.792192 27 1.007495 0 1.631518e+09 1.631518e+09 0.0 StatusType.SUCCESS [one_hot_encoding, no_coalescense, none] [select_rates_regression] None Random Search (sorted)
12 6 0.00 mlp 457.128305 257.670972 11 0.981957 0 1.631516e+09 1.631516e+09 0.0 StatusType.SUCCESS [one_hot_encoding, no_coalescense, standardize] [extra_trees_preproc_for_regression] None Initial design
My system and versions:
This machine runs in a VirtualBox. Host: Windows Guest: Linux
auto-sklearn = "==0.13.0"
python_version = "3.7"
$ uname -a
Linux i 5.4.0-58-generic #64-Ubuntu SMP Wed Dec 9 08:16:25 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 39 bits physical, 48 bits virtual
CPU(s): 14
On-line CPU(s) list: 0-13
Thread(s) per core: 1
Core(s) per socket: 14
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 158
Model name: Intel(R) Core(TM) i9-9900K CPU @ 3.60GHz
Stepping: 13
CPU MHz: 3600.006
BogoMIPS: 7200.01
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 448 KiB
L1i cache: 448 KiB
L2 cache: 3,5 MiB
L3 cache: 224 MiB
I see lots of MEMOUTS while the memory_limit is 27000. I'm I doing something wrong?
I have a Linux laptop as well so I will try the same run on linux without any kind of virtualizations and post my findings here.
Hi @eddiebergman , @mfeurer I tested this on the other physical machine I have. I run into the same issue on that Linux machine with no virtualization at all. Could you please take a look and check what I'm doing wrong. I'm also happy to have a call and show the issue if needed.
Otherwise I won't be able to use this lib and have to switch to something else.
Regards, Stefan
Hi @f-istvan,
Sorry for the delay. I can't immediately see anything wrong with your setup although one thing in general I would recommend is to utilize more of your available cores if the memout issues were to be fixed.
For some context, the fact that so many memouts occur indicates to me a few possible reasons:
Diagnosing those issues can be done if you post the output of df_cv_results['params']
as this essentially contains the high level model definition that was tried with SMAC (our underlying optimizer).
Do the same issues appear at smaller timescales? i.e. 600s total time and 60s per model?
If you could provide this extra information, hopefully that will be enough to diagnose it
@f-istvan,
Did you try setting memory_limit=None
?
Hi,
Sorry for the late response. First of all, here is a full example with generated training data with results:
import numpy as np
import pandas as pd
import autosklearn.regression
value_set = [0.0, 0.25, 0.5, 0.75, 1.0]
col = 89
row = 501808
training_data = np.random.choice(value_set, col * row).reshape(row, col)
df = pd.DataFrame(data=training_data)
print(df)
automl = autosklearn.regression.AutoSklearnRegressor(
time_left_for_this_task=3600,
per_run_time_limit=360,
memory_limit=27000
)
col = 1
row = 501808
target = np.random.choice(value_set, col * row).reshape(row, col)
print('start fit')
automl.fit(training_data, target, dataset_name='github_issue')
print('end fit')
df_cv_results = pd.DataFrame(automl.cv_results_).sort_values(by = 'mean_test_score', ascending = False)
print('df_cv_results')
print(df_cv_results)
print('automl.leaderboard')
print(automl.leaderboard(detailed = True, ensemble_only=False))
print('automl.get_models_with_weights')
print(automl.get_models_with_weights())
print('automl.sprint_statistics')
print(automl.sprint_statistics())
Output:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 ... 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88
0 0.00 0.25 0.50 0.00 0.25 0.50 0.25 0.00 0.00 0.25 1.00 0.00 1.00 1.00 1.00 0.00 0.25 0.50 0.50 0.25 0.50 0.75 0.00 1.00 1.00 ... 0.75 0.75 0.25 0.00 0.50 0.50 0.50 0.50 0.75 0.75 0.75 0.25 1.00 1.00 1.00 1.00 0.25 0.50 0.25 0.50 0.75 0.75 1.00 0.75 0.00
1 0.25 1.00 0.25 0.00 0.75 0.25 0.50 0.00 0.50 0.50 0.25 0.50 0.00 0.50 0.25 0.50 0.75 0.75 0.75 0.25 0.00 0.25 1.00 0.00 0.50 ... 1.00 1.00 0.00 1.00 0.25 0.75 0.50 1.00 0.25 1.00 1.00 0.50 0.50 0.75 0.25 0.00 0.75 0.75 1.00 1.00 0.00 0.00 0.25 0.50 0.75
2 0.75 0.25 0.00 1.00 0.50 0.50 0.25 0.50 0.75 0.25 0.50 0.25 0.50 0.75 0.25 0.25 0.00 0.75 0.00 0.50 0.50 0.25 0.75 0.75 0.75 ... 0.25 0.25 0.25 1.00 0.25 0.75 0.75 0.00 0.75 0.25 0.25 0.25 1.00 0.50 0.75 0.50 0.25 0.25 0.25 0.00 0.00 0.50 1.00 0.50 0.25
3 0.00 0.50 0.25 0.25 0.50 0.75 0.50 0.25 0.00 0.75 0.50 0.50 0.25 1.00 0.00 0.75 0.00 0.50 0.50 0.75 0.00 0.75 0.50 0.50 0.75 ... 0.00 0.00 0.25 0.25 0.50 0.75 0.75 0.00 0.00 0.00 0.25 0.25 0.50 0.25 0.25 0.00 0.75 0.50 0.00 0.50 0.75 0.25 0.50 1.00 0.50
4 0.50 0.50 1.00 0.25 0.50 0.25 0.50 0.75 0.25 0.00 1.00 0.75 0.50 0.25 0.50 1.00 0.00 1.00 0.25 0.25 0.25 0.00 0.25 1.00 0.75 ... 0.75 1.00 0.25 0.75 0.50 1.00 0.50 0.75 1.00 0.75 0.00 0.25 0.25 0.25 0.75 1.00 0.00 0.00 0.00 0.00 0.50 0.00 0.50 0.25 0.00
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
501803 0.00 0.75 0.25 0.75 0.00 0.00 0.00 0.00 0.25 1.00 0.25 0.00 1.00 0.00 1.00 0.00 1.00 0.75 0.75 0.00 0.25 1.00 1.00 0.00 0.50 ... 1.00 1.00 0.50 0.75 1.00 0.25 1.00 0.25 0.75 1.00 0.25 1.00 0.00 0.00 0.25 1.00 1.00 0.00 0.00 0.75 0.00 0.50 0.25 0.50 0.75
501804 0.50 0.50 1.00 1.00 0.00 1.00 0.50 0.00 0.00 1.00 0.00 1.00 1.00 1.00 0.25 0.75 0.50 0.75 0.25 0.50 0.50 0.00 0.00 0.50 0.25 ... 0.75 1.00 0.00 1.00 0.00 0.75 0.00 0.25 1.00 0.25 0.00 0.50 1.00 0.50 1.00 0.25 0.25 0.00 0.00 0.25 0.75 0.25 1.00 0.50 1.00
501805 0.75 0.75 0.25 0.50 1.00 0.25 0.00 0.25 0.00 0.50 1.00 0.25 0.00 0.25 1.00 0.50 0.25 0.75 1.00 0.25 0.50 0.75 0.00 0.00 1.00 ... 0.00 0.50 0.25 0.00 0.00 0.00 0.25 0.00 0.50 0.25 1.00 1.00 0.50 0.25 0.00 1.00 0.75 0.25 0.00 0.50 0.00 1.00 1.00 0.00 0.75
501806 0.25 0.25 0.75 0.75 0.75 0.00 0.50 0.75 0.25 0.50 0.25 0.25 0.50 0.00 0.75 0.50 0.50 0.75 1.00 0.00 1.00 0.25 0.00 0.25 0.50 ... 0.50 0.50 1.00 1.00 1.00 1.00 1.00 0.25 1.00 0.75 0.75 0.25 0.75 0.00 0.50 0.00 0.00 1.00 0.50 0.75 0.75 1.00 0.00 0.50 1.00
501807 0.25 1.00 0.50 0.25 0.00 0.75 1.00 0.50 0.75 0.00 0.00 0.75 0.50 0.00 0.25 1.00 0.50 0.00 0.25 0.75 0.00 0.00 0.00 0.00 0.75 ... 0.75 0.00 0.75 0.25 0.75 1.00 0.75 0.50 0.00 1.00 0.25 0.00 0.25 0.75 0.75 0.50 0.25 0.75 0.00 0.00 0.25 0.25 0.75 1.00 1.00
[501808 rows x 89 columns]
start fit
[WARNING] [2021-09-20 21:00:20,620:Client-EnsembleBuilder] No models better than random - using Dummy loss!Number of models besides current dummy model: 1. Number of dummy models: 1
[WARNING] [2021-09-20 21:06:22,006:Client-EnsembleBuilder] No models better than random - using Dummy loss!Number of models besides current dummy model: 1. Number of dummy models: 1
[WARNING] [2021-09-20 21:08:59,789:Client-EnsembleBuilder] No models better than random - using Dummy loss!Number of models besides current dummy model: 2. Number of dummy models: 1
[WARNING] [2021-09-20 21:09:03,401:Client-EnsembleBuilder] No models better than random - using Dummy loss!Number of models besides current dummy model: 2. Number of dummy models: 1
[WARNING] [2021-09-20 21:15:04,783:Client-EnsembleBuilder] No models better than random - using Dummy loss!Number of models besides current dummy model: 2. Number of dummy models: 1
[WARNING] [2021-09-20 21:21:06,217:Client-EnsembleBuilder] No models better than random - using Dummy loss!Number of models besides current dummy model: 2. Number of dummy models: 1
[WARNING] [2021-09-20 21:21:40,158:Client-EnsembleBuilder] No models better than random - using Dummy loss!Number of models besides current dummy model: 3. Number of dummy models: 1
[WARNING] [2021-09-20 21:27:41,444:Client-EnsembleBuilder] No models better than random - using Dummy loss!Number of models besides current dummy model: 3. Number of dummy models: 1
[WARNING] [2021-09-20 21:27:44,680:Client-EnsembleBuilder] No models better than random - using Dummy loss!Number of models besides current dummy model: 4. Number of dummy models: 1
[WARNING] [2021-09-20 21:27:48,724:Client-EnsembleBuilder] No models better than random - using Dummy loss!Number of models besides current dummy model: 4. Number of dummy models: 1
[WARNING] [2021-09-20 21:32:48,108:Client-EnsembleBuilder] No models better than random - using Dummy loss!Number of models besides current dummy model: 5. Number of dummy models: 1
[WARNING] [2021-09-20 21:32:55,098:Client-EnsembleBuilder] No models better than random - using Dummy loss!Number of models besides current dummy model: 6. Number of dummy models: 1
[WARNING] [2021-09-20 21:33:17,063:Client-EnsembleBuilder] No models better than random - using Dummy loss!Number of models besides current dummy model: 6. Number of dummy models: 1
[WARNING] [2021-09-20 21:39:18,457:Client-EnsembleBuilder] No models better than random - using Dummy loss!Number of models besides current dummy model: 6. Number of dummy models: 1
[WARNING] [2021-09-20 21:39:22,160:Client-EnsembleBuilder] No models better than random - using Dummy loss!Number of models besides current dummy model: 6. Number of dummy models: 1
[WARNING] [2021-09-20 21:45:23,584:Client-EnsembleBuilder] No models better than random - using Dummy loss!Number of models besides current dummy model: 6. Number of dummy models: 1
[WARNING] [2021-09-20 21:51:25,029:Client-EnsembleBuilder] No models better than random - using Dummy loss!Number of models besides current dummy model: 6. Number of dummy models: 1
[WARNING] [2021-09-20 21:51:28,140:Client-EnsembleBuilder] No models better than random - using Dummy loss!Number of models besides current dummy model: 6. Number of dummy models: 1
[WARNING] [2021-09-20 21:53:47,552:Client-EnsembleBuilder] No models better than random - using Dummy loss!Number of models besides current dummy model: 6. Number of dummy models: 1
end fit
df_cv_results
mean_test_score mean_fit_time params rank_test_scores status budgets ... param_regressor:libsvm_svr:gamma param_regressor:mlp:validation_fraction param_regressor:sgd:epsilon param_regressor:sgd:eta0 param_regressor:sgd:l1_ratio param_regressor:sgd:power_t
0 0.000000 360.106616 {'data_preprocessing:categorical_transformer:c... 1 Timeout 0.0 ... NaN NaN NaN NaN NaN NaN
8 0.000000 360.011791 {'data_preprocessing:categorical_transformer:c... 1 Timeout 0.0 ... NaN NaN NaN NaN NaN NaN
18 0.000000 1.826367 {'data_preprocessing:categorical_transformer:c... 1 Memout 0.0 ... NaN NaN NaN NaN NaN NaN
17 0.000000 360.108457 {'data_preprocessing:categorical_transformer:c... 1 Timeout 0.0 ... 0.032332 NaN NaN NaN NaN NaN
16 0.000000 360.104102 {'data_preprocessing:categorical_transformer:c... 1 Timeout 0.0 ... NaN NaN NaN NaN NaN NaN
15 0.000000 2.424781 {'data_preprocessing:categorical_transformer:c... 1 Memout 0.0 ... NaN NaN NaN NaN NaN NaN
14 0.000000 360.104379 {'data_preprocessing:categorical_transformer:c... 1 Timeout 0.0 ... NaN NaN NaN NaN NaN NaN
13 0.000000 20.691383 {'data_preprocessing:categorical_transformer:c... 1 Memout 0.0 ... NaN NaN NaN NaN NaN NaN
10 0.000000 2.746277 {'data_preprocessing:categorical_transformer:c... 1 Memout 0.0 ... NaN NaN NaN NaN NaN NaN
6 0.000000 360.105433 {'data_preprocessing:categorical_transformer:c... 1 Timeout 0.0 ... NaN NaN NaN NaN NaN NaN
5 0.000000 360.104351 {'data_preprocessing:categorical_transformer:c... 1 Timeout 0.0 ... NaN NaN NaN NaN NaN NaN
4 0.000000 2.180909 {'data_preprocessing:categorical_transformer:c... 1 Memout 0.0 ... NaN NaN NaN NaN NaN NaN
2 0.000000 360.104890 {'data_preprocessing:categorical_transformer:c... 1 Timeout 0.0 ... 0.002623 NaN NaN NaN NaN NaN
19 0.000000 138.103932 {'data_preprocessing:categorical_transformer:c... 1 Timeout 0.0 ... NaN NaN NaN NaN NaN NaN
3 -0.000005 157.529032 {'data_preprocessing:categorical_transformer:c... 15 Success 0.0 ... NaN NaN NaN NaN NaN NaN
9 -0.000006 2.959799 {'data_preprocessing:categorical_transformer:c... 16 Success 0.0 ... NaN NaN NaN NaN NaN NaN
12 -0.000013 6.571774 {'data_preprocessing:categorical_transformer:c... 17 Success 0.0 ... NaN NaN NaN NaN NaN NaN
1 -0.000317 12.087915 {'data_preprocessing:categorical_transformer:c... 18 Success 0.0 ... NaN NaN NaN NaN NaN NaN
7 -0.001563 33.671417 {'data_preprocessing:categorical_transformer:c... 19 Success 0.0 ... NaN NaN NaN NaN NaN NaN
11 -0.002651 299.073780 {'data_preprocessing:categorical_transformer:c... 20 Success 0.0 ... NaN 0.1 NaN NaN NaN NaN
[20 rows x 161 columns]
automl.leaderboard
Traceback (most recent call last):
File "app.py", line 34, in <module>
print(automl.leaderboard(detailed = True, ensemble_only=False))
File "/home/i/dev/sources/mytest/.venv/lib/python3.7/site-packages/autosklearn/estimators.py", line 741, in leaderboard
model_runs[model_id]['ensemble_weight'] = weight
KeyError: 1
@eddiebergman I tried to set n_jobs=2
, 3, 4 up to 8. In all the cases I got a Killed
message on my console and the program just stopped. Now based on this stackoverflow question I think this has the same kind of memory issue: https://stackoverflow.com/questions/19189522/what-does-killed-mean-when-a-processing-of-a-huge-csv-with-python-which-sudde
Do the same issues appear at smaller timescales? i.e. 600s total time and 60s per model? -> no, with smaller timescales it finishes successfully. I think 120s total was successful once when I tried to play with this.
Did you try setting memory_limit=None
? -> not yet, I will try to do that and post the df_cv_results['params'] too.
Thank you so much!
Hmm so let me address this in a few points:
warnings: This is kind of expected due to the fact the mapping from inputs to outputs is random. This is okay but perhaps we should really just hide those in the log and not display them as big warnings. If of course the issue persists at the end, then we can give that warning.
memouts: What's interesting is that these memouts occur quite quickly, despite the dataset being quite small. A further note, this memory=27GB
is split between n_jobs
and this should perhaps be made more clear.
Anyways This could be indicative of two things -
We use the temporary directory by default. I had an issue recently where my system storage was fine (200GB+) but the partition that housed /tmp
only had 1GB of free space, causing containers to not build properly. To diagnose this, you can use a graphical interface but also the command df -H
. If your /tmp
dir doesn't have 27GB available then this would explain it and we should document this behavior more clearly or perform a check before running. If this is not the issue then we have a memory issue somewhere and we would love to find it.
Four possible workarounds in the meantime:
$TMPDIR
$TMPDIR=/path/to/custom/temp python myscript.py
tmp_folder
and delete_tmp_folder_after_terminate
.
If you go with this work around and set delete_tmp_folder_after_terminate = False
then you should be able to inspect what is consuming the most memory.
Note: we will likely change this to just a single parameter working_dir
for version 0.15.0
as these parameters are often set together.max_models_on_disc
, which defaults to 50. While I think 50 models should easily fit in 27GB, it's just another tunable parameter I can point you to that might help the issue./tmp
has the 27GB of space expected.The second possible issue could be that some model configurations are eating up way more memory than expected. A typical memory hungry model is a KNN but as stated, the optimizer should move away from these models once a few failures occur.
To diagnose this, it would be helpful for me to see the csv output of df_cv_results
. In the meantime, if you can see one particular model is causing this issue (filter df_cv_results
by status == memout
) then that's indicative of something going wrong on our side and we would be glad to fix it.
In general it's quite difficult to allocate resources to do runs as long as your but it seems like it's something we should try testing soon. We appreciate your time and effort and hopefully we can figure this out.
Seeing as there has been no response, we're not sure if this has been solved so closing the issue for now. Feel free to re-open this if anything reoccurs.
Hello,
I am running auto-sklearn on a Google Cloud machine in Jupyter. I keep getting the following out of memory error no matter how much memory I assigned to
ml_memory_limit
. The following is the error message I am getting:The following is my initialization code:
The X_train has 400K rows with 5 columns of data. The y_train is a vector with 400K rows of data. I am using
auto-sklearn==0.10.0
. I have been adjusting theml_memory_limit
beyond 5000 MB but the program returned pretty quickly with the same error. Theml_memory_limit
doesn't seem to be honored. I have tried the suggestions in issue#520 but to no avail.I tried to run the following example in the Jupyter notebook to make sure I am using the library correctly:
It finished training successfully.
I would appreciate any help from the community!
Environment: