automl / Auto-PyTorch

Automatic architecture search and hyperparameter optimization for PyTorch
Apache License 2.0
2.37k stars 288 forks source link

Predict() crashes on CPU #4

Closed KEggensperger closed 5 years ago

KEggensperger commented 5 years ago

When running examples/basics/basic_classification.py on a CPU, I receive the following output and error:

11:11:18 WORKER: start listening for jobs
11:11:18 [AutoNet] Start bohb
11:11:18 DISPATCHER: started the 'discover_worker' thread
11:11:18 DISPATCHER: started the 'job_runner' thread
11:11:18 DISPATCHER: Pyro daemon running on 10.5.150.146:37155
11:11:19 DISPATCHER: discovered new worker, hpbandster.run_0.worker.mllap06.2084.-1139912695220032
11:11:19 HBMASTER: starting run at 1548324679.0029259
11:11:19 HBMASTER: adjusted queue size to (0, 1)
11:11:19 WORKER: start processing job (0, 0, 0)
11:11:19 Fit optimization pipeline
11:11:22 Finished train with budget 1.0: Preprocessing took 0s, Training took 2s, Wrap up took 0s. Total time consumption in s: 3
11:11:22 Training ['resnet'] with budget 1.0 resulted in score: -39.92592692375183 took 3.185837745666504 seconds
11:11:22 WORKER: registered result for job (0, 0, 0) with dispatcher
11:11:22 WORKER: start processing job (0, 0, 1)
11:11:22 Fit optimization pipeline
11:11:23 Finished train with budget 1.0: Preprocessing took 0s, Training took 0s, Wrap up took 0s. Total time consumption in s: 1
11:11:23 Training ['shapedmlpnet'] with budget 1.0 resulted in score: -39.92592692375183 took 1.4267945289611816 seconds
11:11:23 WORKER: registered result for job (0, 0, 1) with dispatcher
11:11:23 WORKER: start processing job (0, 0, 2)
11:11:23 Fit optimization pipeline
11:11:30 Finished train with budget 1.0: Preprocessing took 0s, Training took 5s, Wrap up took 1s. Total time consumption in s: 7
11:11:31 Training ['resnet'] with budget 1.0 resulted in score: -27.77777910232544 took 7.370416879653931 seconds
11:11:31 WORKER: registered result for job (0, 0, 2) with dispatcher
11:11:31 WORKER: start processing job (0, 0, 3)
11:11:31 Fit optimization pipeline
11:11:33 Finished train with budget 1.0: Preprocessing took 1s, Training took 0s, Wrap up took 0s. Total time consumption in s: 2
11:11:33 Training ['resnet'] with budget 1.0 resulted in score: -27.77777910232544 took 2.184109687805176 seconds
11:11:33 WORKER: registered result for job (0, 0, 3) with dispatcher
11:11:33 WORKER: start processing job (0, 0, 4)
11:11:33 Fit optimization pipeline
11:11:36 Finished train with budget 1.0: Preprocessing took 0s, Training took 2s, Wrap up took 0s. Total time consumption in s: 3
11:11:36 Training ['resnet'] with budget 1.0 resulted in score: -27.77777910232544 took 3.546250581741333 seconds
11:11:36 WORKER: registered result for job (0, 0, 4) with dispatcher
11:11:36 WORKER: start processing job (0, 0, 5)
11:11:36 Fit optimization pipeline
/home/eggenspk/anaconda3/envs/Autopytorch_36/lib/python3.6/site-packages/scikit_learn-0.20.2-py3.6-linux-x86_64.egg/sklearn/decomposition/fastica_.py:305: UserWarning: n_components is too large: it will be set to 21
  warnings.warn('n_components is too large: it will be set to %s' % n_components)
[...]
11:12:12 WORKER: registered result for job (0, 0, 6) with dispatcher
11:12:12 WORKER: start processing job (0, 0, 0)
11:12:12 Fit optimization pipeline
11:12:35 Finished train with budget 9.0: Preprocessing took 0s, Training took 21s, Wrap up took 0s. Total time consumption in s: 22
11:12:35 Training ['resnet'] with budget 9.0 resulted in score: -61.33333444595337 took 22.3313090801239 seconds
11:12:35 WORKER: registered result for job (0, 0, 0) with dispatcher
11:12:35 DISPATCHER: Dispatcher shutting down
11:12:35 DISPATCHER: shut down complete
11:12:35 Start autonet with config:
{'budget_type': 'epochs', 'min_budget': 1, 'max_budget': 9, 'num_iterations': 1, 'log_level': 'info', 'shuffle': True, 'hyperparameter_search_space_updates': None, 'run_id': '0', 'task_id': -1, 'algorithm': 'bohb', 'result_logger_dir': '.', 'eta': 3, 'min_workers': 1, 'working_dir': '.', 'network_interface_name': 'eth1', 'memory_limit_mb': 1000000, 'use_tensorboard_logger': False, 'validation_split': 0.0, 'cv_splits': 1, 'use_stratified_cv_split': True, 'min_budget_for_cv': 0, 'half_num_cv_splits_below_budget': 0, 'imputation_strategies': ['mean', 'median', 'most_frequent'], 'normalization_strategies': ['none', 'minmax', 'standardize', 'maxabs'], 'categorical_features': [], 'preprocessors': ['none', 'truncated_svd', 'fast_ica', 'kitchen_sinks', 'kernel_pca', 'nystroem'], 'over_sampling_methods': ['none', 'random', 'smote'], 'under_sampling_methods': ['none', 'random'], 'target_size_strategies': ['none', 'upsample', 'downsample', 'average', 'median'], 'embeddings': ['none', 'learned'], 'networks': ['mlpnet', 'shapedmlpnet', 'resnet', 'shapedresnet'], 'final_activation': 'softmax', 'optimizer': ['adam', 'sgd'], 'lr_scheduler': ['cosine_annealing', 'cyclic', 'exponential', 'step', 'plateau', 'none'], 'additional_logs': [], 'train_metric': 'accuracy', 'additional_metrics': [], 'loss_modules': ['cross_entropy', 'cross_entropy_weighted'], 'batch_loss_computation_techniques': ['standard', 'mixup'], 'training_techniques': ['early_stopping'], 'minimize': False, 'cuda': True, 'eval_on_training': False, 'full_eval_each_epoch': False, 'early_stopping_patience': inf, 'early_stopping_reset_parameters': False, 'random_seed': 647837117, 'max_runtime': inf}
11:12:56 Finished train with budget 9.0: Preprocessing took 0s, Training took 20s, Wrap up took 0s. Total time consumption in s: 21
({'Imputation:strategy': 'most_frequent', 'LearningrateSchedulerSelector:lr_scheduler': 'cyclic', 'LossModuleSelector:loss_module': 'cross_entropy_weighted', 'NetworkSelector:network': 'resnet', 'NormalizationStrategySelector:normalization_strategy': 'none', 'OptimizerSelector:optimizer': 'adam', 'PreprocessorSelector:preprocessor': 'nystroem', 'ResamplingStrategySelector:over_sampling_method': 'smote', 'ResamplingStrategySelector:target_size_strategy': 'downsample', 'ResamplingStrategySelector:under_sampling_method': 'random', 'TrainNode:batch_loss_computation_technique': 'standard', 'TrainNode:batch_size': 90, 'LearningrateSchedulerSelector:cyclic:cycle_length': 10, 'LearningrateSchedulerSelector:cyclic:max_factor': 1.192086838687192, 'LearningrateSchedulerSelector:cyclic:min_factor': 0.1739108186563212, 'NetworkSelector:resnet:activation': 'sigmoid', 'NetworkSelector:resnet:blocks_per_group': 4, 'NetworkSelector:resnet:num_groups': 5, 'NetworkSelector:resnet:num_units_0': 12, 'NetworkSelector:resnet:num_units_1': 134, 'NetworkSelector:resnet:use_dropout': False, 'NetworkSelector:resnet:use_shake_drop': False, 'NetworkSelector:resnet:use_shake_shake': False, 'OptimizerSelector:adam:learning_rate': 0.03881547471994297, 'OptimizerSelector:adam:weight_decay': 0.04619218671240021, 'PreprocessorSelector:nystroem:kernel': 'poly', 'PreprocessorSelector:nystroem:n_components': 449, 'ResamplingStrategySelector:smote:k_neighbors': 3, 'NetworkSelector:resnet:num_units_2': 374, 'NetworkSelector:resnet:num_units_3': 413, 'NetworkSelector:resnet:num_units_4': 244, 'NetworkSelector:resnet:num_units_5': 14, 'PreprocessorSelector:nystroem:coef0': -0.2635372875151014, 'PreprocessorSelector:nystroem:degree': 3, 'PreprocessorSelector:nystroem:gamma': 0.00023889215064720927}, -61.33333444595337)
Traceback (most recent call last):
  File "examples/basics/basic_classification.py", line 20, in <module>
    print("Score:", autonet.score(X_test=dm.X_train, Y_test=dm.Y_train))
  File "/home/eggenspk/anaconda3/envs/Autopytorch_36/lib/python3.6/site-packages/autoPyTorch-0.0.1-py3.6.egg/autoPyTorch/core/api.py", line 184, in score
    self.pipeline.predict_pipeline(pipeline_config=self.autonet_config, X=X_test)
  File "/home/eggenspk/anaconda3/envs/Autopytorch_36/lib/python3.6/site-packages/autoPyTorch-0.0.1-py3.6.egg/autoPyTorch/pipeline/base/pipeline.py", line 50, in predict_pipeline
    return self.root.predict_traverse(**kwargs)
  File "/home/eggenspk/anaconda3/envs/Autopytorch_36/lib/python3.6/site-packages/autoPyTorch-0.0.1-py3.6.egg/autoPyTorch/pipeline/base/node.py", line 136, in predict_traverse
    node.predict_output = node.predict(**required_kwargs)
  File "/home/eggenspk/anaconda3/envs/Autopytorch_36/lib/python3.6/site-packages/autoPyTorch-0.0.1-py3.6.egg/autoPyTorch/pipeline/nodes/optimization_algorithm.py", line 116, in predict
    result = self.sub_pipeline.predict_pipeline(pipeline_config=pipeline_config, X=X)
  File "/home/eggenspk/anaconda3/envs/Autopytorch_36/lib/python3.6/site-packages/autoPyTorch-0.0.1-py3.6.egg/autoPyTorch/pipeline/base/pipeline.py", line 50, in predict_pipeline
    return self.root.predict_traverse(**kwargs)
  File "/home/eggenspk/anaconda3/envs/Autopytorch_36/lib/python3.6/site-packages/autoPyTorch-0.0.1-py3.6.egg/autoPyTorch/pipeline/base/node.py", line 136, in predict_traverse
    node.predict_output = node.predict(**required_kwargs)
  File "/home/eggenspk/anaconda3/envs/Autopytorch_36/lib/python3.6/site-packages/autoPyTorch-0.0.1-py3.6.egg/autoPyTorch/pipeline/nodes/cross_validation.py", line 121, in predict
    result = self.sub_pipeline.predict_pipeline(pipeline_config=pipeline_config, X=X)
  File "/home/eggenspk/anaconda3/envs/Autopytorch_36/lib/python3.6/site-packages/autoPyTorch-0.0.1-py3.6.egg/autoPyTorch/pipeline/base/pipeline.py", line 50, in predict_pipeline
    return self.root.predict_traverse(**kwargs)
  File "/home/eggenspk/anaconda3/envs/Autopytorch_36/lib/python3.6/site-packages/autoPyTorch-0.0.1-py3.6.egg/autoPyTorch/pipeline/base/node.py", line 136, in predict_traverse
    node.predict_output = node.predict(**required_kwargs)
  File "/home/eggenspk/anaconda3/envs/Autopytorch_36/lib/python3.6/site-packages/autoPyTorch-0.0.1-py3.6.egg/autoPyTorch/pipeline/nodes/train_node.py", line 101, in predict
    Y = predict(network, X, 20, device)
  File "/home/eggenspk/anaconda3/envs/Autopytorch_36/lib/python3.6/site-packages/autoPyTorch-0.0.1-py3.6.egg/autoPyTorch/pipeline/nodes/train_node.py", line 276, in predict
    network = network.to(device)
  File "/home/eggenspk/anaconda3/envs/Autopytorch_36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 381, in to
    return self._apply(convert)
  File "/home/eggenspk/anaconda3/envs/Autopytorch_36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 187, in _apply
    module._apply(fn)
  File "/home/eggenspk/anaconda3/envs/Autopytorch_36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 187, in _apply
    module._apply(fn)
  File "/home/eggenspk/anaconda3/envs/Autopytorch_36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 193, in _apply
    param.data = fn(param.data)
  File "/home/eggenspk/anaconda3/envs/Autopytorch_36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 379, in convert
    return t.to(device, dtype if t.is_floating_point() else None, non_blocking)
  File "/home/eggenspk/anaconda3/envs/Autopytorch_36/lib/python3.6/site-packages/torch/cuda/__init__.py", line 161, in _lazy_init
    _check_driver()
  File "/home/eggenspk/anaconda3/envs/Autopytorch_36/lib/python3.6/site-packages/torch/cuda/__init__.py", line 75, in _check_driver
    raise AssertionError("Torch not compiled with CUDA enabled")
AssertionError: Torch not compiled with CUDA enabled

It seems like it is searching for a GPU only for scoring the final model (also the incumbent has 'cuda': True), but of course there is none as I am using pytorch-cpu. Also python -c "import torch; torch.cuda.is_available()" returns False on my machine.

mlindauer commented 5 years ago

I can reproduce the issue on my machine with a freshly installed conda environment.

mlindauer commented 5 years ago
 conda create -n autopytorch python pip
 source activate autopytorch
 git clone https://github.com/automl/Auto-PyTorch.git
 cd Auto-PyTorch/
 conda install pytorch-cpu -c pytorch
 pip install -r requirements.txt 
 python setup.py install
 cat README.md 
 vi test.py
 python test.py

with running the example from the README

urbanmatthias commented 5 years ago

Hi,

we forgot to check if cuda is available in predict(). My last commit should fix this issue.

Cheers,

Matthias

KEggensperger commented 5 years ago

Yes, now it works! Still, the final incumbent has cuda=True, but feel free to close this issue for now.

urbanmatthias commented 5 years ago

The dictionary containing cuda=True is not the final incumbent, but the settings Auto-PyTorch has been started with (Hyperparameters of Auto-PyTorch).

It was not possible to set the default value of cuda to cuda.is_available(), because that caused problems with pynisher. It seems to be impossible to call cuda methods from different processes.

So we chose to set the default of cuda to True and then disable it, if it is not available.