Talented-zack / AutoGLuon-V.S-XGBoost

The performance comparison of the newest AutoGLuon V.S XGBoostClassifier and RandomForestClassifier
2 stars 1 forks source link

Auto-GLuon error:Exception in worker process: Can't pickle local object 'TaskScheduler._run_dist_job.<locals>._worker' #1

Open Talented-zack opened 4 years ago

Talented-zack commented 4 years ago

Hi, all! Very excited to see that Amazon launches the Auto-GLuon algorithms. I am doing a tabular data prediction project, and I tried two different code: The first is without setting parameters: predictor = task.fit(train_data=train_data, label=label_column, output_directory=dir) This line of code runs well;

However, when I read the In-depth FIT Tutorial of tabular prediction section, I wrote my second line of code: predictor = task.fit(train_data=train_data, label=label_column, output_directory=dir, num_trials=num_trials,hyperparameter_tune=True, hyperparameters=hyperparameters) It shows the error: Auto-GLuon error:Exception in worker process: Can't pickle local object 'TaskScheduler._run_dist_job.._worker'

The parameters inside the task.fit has been set, so I doubt the error is due to the packages I missed.

First version

Auto-GLuon without setting parameters

label_column='default.payment.next.month' print("Summary of class variable: \n", train_data[label_column].describe()) dir = 'agModels-predictClass' # specifies folder where to store trained models predictor = task.fit(train_data=train_data, label=label_column, output_directory=dir)

Second Version

Auto-GLuon with setting parameters

label_column='default.payment.next.month' dir = 'agModels-predictClass' # specifies folder where to store trained models hp_tune = True # whether or not to do hyperparameter optimization

nn_options = { # specifies non-default hyperparameter values for neural network models 'num_epochs': 10, # number of training epochs (controls training time of NN models) 'learning_rate': ag.space.Real(1e-4, 1e-2, default=5e-4, log=True), # learning rate used in training (real-valued hyperparameter searched on log-scale) 'activation': ag.space.Categorical('relu', 'softrelu', 'tanh'), # activation function used in NN (categorical hyperparameter, default = first entry) 'layers': ag.space.Categorical([100],[1000],[200,100],[300,200,100]),

Each choice for categorical hyperparameter 'layers' corresponds to list of sizes for each NN layer to use

'dropout_prob': ag.space.Real(0.0, 0.5, default=0.1), # dropout probability (real-valued hyperparameter)

}

gbm_options = { # specifies non-default hyperparameter values for lightGBM gradient boosted trees 'num_boost_round': 100, # number of boosting rounds (controls training time of GBM models) 'num_leaves': ag.space.Int(lower=26, upper=66, default=36), # number of leaves in trees (integer hyperparameter) }

hyperparameters = {'NN': nn_options, 'GBM': gbm_options} # hyperparameters of each model type

If one of these keys is missing from hyperparameters dict, then no models of that type are trained.

num_trials = 5 # try at most 3 different hyperparameter configurations for each type of model search_strategy = 'skopt' # to tune hyperparameters using SKopt Bayesian optimization routine predictor = task.fit(train_data=train_data, label=label_column, output_directory=dir, num_trials=num_trials,hyperparameter_tune=True, hyperparameters=hyperparameters

Innixma commented 4 years ago

Hi Yiyang,

Great to see you are trying out AutoGluon! We will be looking into this issue you are having in the AutoGluon issue linked by Hang. If you plan on comparing AutoGluon to other frameworks, I suggest you also try out AutoGluon with the only parameter set being auto_stack=True, this should give significantly improved results on most problems.

Best, Nick

jwmueller commented 4 years ago

@Talented-zack Do you get the same error if you don't specify hyperparameters, and instead just do:

predictor = task.fit(train_data=train_data, label=label_column, output_directory=dir, num_trials=num_trials,hyperparameter_tune=True)

Also, can you post the entire output after calling task.fit(), not just the error message? Finally, can you post your OS info, python version, and the output of a pip freeze? Thanks!

Talented-zack commented 4 years ago

Hi Yiyang,

Great to see you are trying out AutoGluon! We will be looking into this issue you are having in the AutoGluon issue linked by Hang. If you plan on comparing AutoGluon to other frameworks, I suggest you also try out AutoGluon with the only parameter set being auto_stack=True, this should give significantly improved results on most problems.

Best, Nick

Hi Nick,

Thanks so much for replying! The article about comparing the performance achieves an unexpected success at my school.

But I have tried your method, if I set auto_stack=True, it needs a lot of space and memory which my Lenovo laptop can't make it. I guess so far what I have done in the project is the best performance I can make on my laptop.

Sincerely,

Yiyang Zhang

Talented-zack commented 4 years ago

@Talented-zack Do you get the same error if you don't specify hyperparameters, and instead just do:

predictor = task.fit(train_data=train_data, label=label_column, output_directory=dir, num_trials=num_trials,hyperparameter_tune=True)

Also, can you post the entire output after calling task.fit(), not just the error message? Finally, can you post your OS info, python version, and the output of a pip freeze? Thanks!

Hi,

So glad to hear from you!

Yes, I get the same error if I use "predictor = task.fit(train_data=train_data, label=label_column, output_directory=dir, num_trials=num_trials,hyperparametertune=True)" command. It shows the same error: "Exception in worker process: Can't pickle local object 'TaskScheduler._run_dist_job.._worker'_". I guess the reason for the error is the "hyperparameter_tune=True" setting.

The error for setting "hyperparameter_tune=True" is as follows: **_Beginning AutoGluon training ... AutoGluon will save models to agModels-predictClass/ Preprocessing data ... Here are the first 10 unique label values in your data: [1 0] AutoGluon infers your prediction problem is: binary (because only two unique label-values observed) If this is wrong, please specify problem_type argument in fit() instead (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])

Selected class <--> label mapping: class 1 = True, class 0 = False Data preprocessing and feature engineering runtime = 0.48s ... AutoGluon will gauge predictive performance using evaluation metric: accuracy To change this, specify the eval_metric argument of fit() Starting Experiments Num of Finished Tasks is 0 Num of Pending Tasks is 5 HBox(children=(FloatProgress(value=0.0, max=5.0), HTML(value=''))) Exception in worker process: Can't pickle local object 'TaskScheduler._run_dist_job..worker'**

Lastly, my python version is 3.74 and my computer is Windows 7.

Thank you

Sincerely,

Yiyang Zhang

qu4n7 commented 4 years ago

hi all! thank you for the release of autogluon. i'm quite excited to test it out. while trying to test it on cifar10 database, faced similar problem (win10, Python 3.7.4): the code from autogluon import ImageClassification as task train_data = task.Dataset('cifar10') classifier = task.fit(train_data) leads to

Starting Experiments Num of Finished Tasks is 0 Num of Pending Tasks is 2 scheduler: FIFOScheduler( DistributedResourceManager{ (Remote: Remote REMOTE_ID: 0, <Remote: 'inproc://192.168.43.249/8916/1' processes=1 threads=8, > memory=17.06 GB>, Resource: NodeResourceManager(8 CPUs, 0 GPUs)) }) Exception in worker process: Can't pickle local object 'TaskScheduler._run_dist_job.._worker'

zhanghang1989 commented 4 years ago

Could you provide the dask and distributed package versions you are using?

pip  list | grep dask
pip  list | grep distributed
365gg commented 4 years ago

When I use code from https://autogluon.mxnet.io/tutorials/torch/hpo.html#convert-the-training-function-to-be-searchable. I got the same error: _Traceback (most recent call last): File "", line 1, in File "D:\Program Files\Anaconda3\envs\pytorch_yolo3\lib\multiprocessing\spawn.py", line 105, in spawn_main exitcode = _main(fd) File "D:\Program Files\Anaconda3\envs\pytorch_yolo3\lib\multiprocessing\spawn.py", line 115, in _main self = reduction.pickle.load(from_parent) EOFError: Ran out of input Task exception was never retrieved future: <Task finished coro=<InProcConnector.connect() done, defined at D:\Program Files\Anaconda3\envs\pytorch_yolo3\lib\site-packages\distributed\comm\inproc.py:285> exception=OSError("no endpoint for inproc address '192.168.1.101/21348/1'",)> Traceback (most recent call last): File "D:\Program Files\Anaconda3\envs\pytorchyolo3\lib\site-packages\distributed\comm\inproc.py", line 288, in connect raise IOError("no endpoint for inproc address %r" % (address,)) OSError: no endpoint for inproc address '192.168.1.101/21348/1'

my environments are: os: win10 python: 3.6 pytorch: 1.3.1 autogluon: 0.0.6

I need your help.

Innixma commented 4 years ago

Hi 365gg,

Thanks for the error report! I suspect that there is some issue related to Windows for ImageClassification in AutoGluon. Unfortunately we do not officially support Windows at the current time and functionality is untested for Windows. In future we will work to expand Windows support, but for now I would suggest running AutoGluon on Linux for ImageClassification.

Best, Nick

jeanjerome commented 3 years ago

Hi Innixma, My two cents for others AutoGluon testers: I got the same error "Exception in worker process: Can't pickle local object 'TaskScheduler._run_dist_job.._worker'" on MacOS 10.14 with Python 3.8.6, dask 2020.12.0, distributed 2020.12.0 and autogluon 0.0.15 when classifying images :(

Innixma commented 2 years ago

@jeanjerome AutoGluon v0.0.15 is over 1.5 years old.

I'd recommend using AutoGluon v0.4.0 which released this month.