automl / HpBandSter

a distributed Hyperband implementation on Steroids
BSD 3-Clause "New" or "Revised" License
608 stars 109 forks source link

Learning phase error with batchnormalization #58

Open AlexTreacher opened 5 years ago

AlexTreacher commented 5 years ago

Hello All,

I am working with using this framework to optimize deep learning architectures. I am running into issues when I add batch normalization and use the BOHB optimizer. To make it a simple as possible I reproduced the problem in example 5 and have added batch normalization.

When running the models individually using a random selection they always work. However when I run the example_5_mnist using the keras worker that includes batch normalization with the BOHB optimizer I frequently get this error in the results output for models: " c_api.TF_GetCode(self.status.status))\ntensorflow.python.framework.errors_impl.InvalidArgumentError: Tensor dropout_1/keras_learning_phase:0, specified in either feed_devices or fetch_devices was not found in the Graph\n " If I remove batch normalization from the example_5_keras_worker, then I do not have such errors.

I have attached the new worker file in case that helps (as a txt file for comparability). You can see that the batch normalization addition in lines 86-87, 94-95, 102-103, 172-174, and 195-199. example_5_keras_worker_BN.txt

I have yet to track down the problem, but thought I'd post here in case someone else has ran into the same issue. If no one posts an answer and if I find what is going on, I'll be sure to post the solution.

Thanks all!

sfalkner commented 5 years ago

Hello, thanks for sharing. I don't use much tensorflow myself, so I have not encountered any issue like that. Maybe as a pointer for your investigation: every call to your workers compute method is executed in a separate thread. That used to cause major issues with tensorflow, but was fixed at some point. Maybe that has something to do with it?