autonomio / talos

Hyperparameter Experiments with TensorFlow and Keras
https://autonom.io
MIT License
1.62k stars 268 forks source link

Talos installation on a Windows system within Conda Environment #538

Closed MarkusMiller closed 3 years ago

MarkusMiller commented 3 years ago

Hello,

I'm trying to setup a new Deep Learning computer for over 2 days now and I'm most likely failing due to the mess between conda and pip.

I upgraded the old system when Conda offered tensorflow-gpu 2.3 a couple of months ago, but I don't remember the specific install order back then (I had to install tensorflow-gpu 2.1 first, due to missing cuda in the 2.3 package). I finally managed to install tensorflow-gpu=2.3 via conda and Talos 1.0 via pip without any issues or error reports back then and the training on this system worked without any issues.

On the new system, however, exporting the old environment either with an explicit-package-file or by exporting the packages into a yml didn't work as one of the packages was always missing. So I skipped this and moved on to create the env from scratch and tried basically all possible permutations to install tensorflow-gpu and talos but always end up with a dying kernel as soon as I import talos and tensorflow or the optimization process outputs nan as loss for some (but not all) hyperparameter combinations, which is not very surprising as the install process keeps uninstalling and installing previously installed packages and finally ends with an error about tensorflow-gpu=2.2 (which was never actually installed)

Edit: having tensorflow-gpu=2.3 with talos and no errors or version changes during the setup and using the same code without any changes on the old system returns reliable optimization results for each combination

That are the final logs of the talos install process. Previously I have installed tensorflow-gpu=2.1 and updated it to tensorflow-gpu=2.3

image

I also tried different talos builds (stated in the docs) but they appear to have ceased to exist. Could you please provide information on how to install pip along with the latest conda tensorflow-gpu (2.3)?

mikkokotila commented 3 years ago

You have to install Tensorflow and not Tensorflow-gpu. Talos 1.0 will install Tensorflow.

MarkusMiller commented 3 years ago

Thanks for simply closing the issue. Using Tensorflow instead of tensorflow-gpu results in not detecting my GPU any longer while pip install talos is still unsolicitedly overwriting a lot of stuff that was installed in previous steps. And environment stability is still highly dependent on install order (resulting in unrecoverable kernel crashes if performed in an unlucky order)

you may leave this closed, though, as I'll be moving to a different solution due to the problems I've got...