getnamo / TensorFlow-Unreal

TensorFlow plugin for the Unreal Engine.
Other
1.15k stars 211 forks source link

Dependency problem on WIN64 UE 4.18.3 #22

Closed Gaspa90 closed 6 years ago

Gaspa90 commented 6 years ago

Hi, I am trying to get the example project going. I followed all instructions about installing GPU support with CUDA and setting of env variables. Then downloaded the project example and included the plugin folder. Running on Basic Map seems to work fine, even though i get a ue_site not found error, i can send operations and getting results as you can see from the output log. basic_outputlog.txt

Running the Mnist map causes an error of simplejson not found, and if I try sending inputs while playing in the editor viewport I get a AttributeError: 'TensorFlowComponent' object has no attribute 'tfapi' error too. mnist_outputlog.txt

Do you have any idea of what can cause this issue? Thx in advance for your support and for your effort in getting ML on UE4.

Gaspa90 commented 6 years ago

I went ahead and use the upypip to install simplejson. Then I tried again running MNIST map. This seems to solve the initial problem since now the level opens the datasets, but after few seconds it starts training the all UE4 crashes. Here is the output log i found afterwards inside the project folder TensorFlowExamples.log Any helpful insights?

getnamo commented 6 years ago

It looks like you pulled tensorflow-ue4-examples directly from git master branch, apparently I left in a dangling dependency during a keras optimization script commit earlier. You can safely remove the import simplejson from the mnistKerasCNNOpt as it's not needed, or you can add "simplejson":"latest" to the project upymodule.json and it will ensure that dependency is always there. That said I recommend you download a stable release from https://github.com/getnamo/tensorflow-ue4-examples/releases/ which are usually tested and shouldn't have broken commits. Also ue_site module error btw can be safely ignored.

As to why it's crashing, unfortunately your log doesn't show anything other than that it is crashing at the first epoch. Have you tried the CPU version of the plugin first? an easy way to do this would be to type

import upypip as pip
pip.uninstallAll()

wait until that finishes (it may not work if you hit play and it loaded the tensorflow dlls, restart then try again) and either change the plugin upymodule.json to just tensorflow and relaunch the project.

then see if mnistSpawnSamples for example works or not. Then if it does replace the dependency back to tensorflow-gpu and see if it breaks. Then make sure you have the exact dependencies listed for your plugin release https://github.com/getnamo/tensorflow-ue4/releases/, it's listed in the release notes. For the latest plugin release 0.7.0 which uses tensorflow v1.6 it's CUDA v9.0 (not 9.1!) and cuDNN v7.

NB: You can also change the script that you use for mnist by clicking on the ConnectedTFMnistActor and changing the python TFModuleLine

image

as specified here https://github.com/getnamo/tensorflow-ue4-examples#other-classifiers-eg-cnn-keras-model

I recommend trying mnistSpawnSamples first. Generally any mnist scripts found in https://github.com/getnamo/tensorflow-ue4-examples/tree/master/Content/Scripts should be compatible. Perhaps some of those will work better.

Gaspa90 commented 6 years ago

Thanks for your speed-of-light reply. I indeed installed from master, that was my bad. I then used the links that you provided both for example-project and for plugins(for that one I was using even before the right release.) and that solves the simplejson error and ue_site is shown as a warning.

Regarding "For the latest plugin release 0.7.0 which uses tensorflow v1.6 it's CUDA v9.0 (not 9.1!) and cuDNN v7" I got the right versions. By adding few lines on the mnistSpawnSamples i see my GPU correctly listed by _device_lib.list_localdevices() and i see correct outputs using with tf.device('/gpu:0')

At the moment if I try to run mnistSpawnSamples or mnistSimple it works, but it was working even before with those... It seems that UE4 crashes only on mnistKerasCNN, exactly on the model.fit line .It sounds suspicious to me that the mnistKerasCNN is the only script using keras, coincidence?! Another thing worth to mention is that i see my GPU Memory usage climb up to the top when i run mnistKerasCNN, before UE4 crashes. To check this I added few lines ( config = tf.ConfigProto() config.gpu_options.per_process_gpu_memory_fraction) to limit GPU usage to 30% and reduced batch_size parameter to 32 for good measure. Now when i run the script i see my GPU memory usage increasing only of 30% but UE4 still crashes at the same point.

Here It is 1AM and i didn't yet tried your suggestion about using the cpu version: I can live by using only tensorflow without keras (if is that the problem); but i cannot live by training on a cpu... Anyway tomorrow I will try your suggestion about installing the cpu version and then change to gpu to see what happens.

I am looking forward to have everything set-up and ready to start some amazing projects about unbeatable AIs :-) .but i want to be sure the environment is correctly configured!

Gaspa90 commented 6 years ago

Ok i tried creating a new project using just tensorflow CPU and it seems to work fine for all scripts. But KerasCNN takes 450s to complete training bringing cpu usage to 100%. Since the GPU version Is crashing only on KerasCNN I ask myself if that is the only script that is using the CUDNN7 module, is it right? Can be some problem about linking that module? Nvidia Installation guide (http://docs.nvidia.com/deeplearning/sdk/cudnn-install/index.html) says:

3.3. Installing cuDNN on Windows The following steps describe how to build a cuDNN dependent program. In the following sections:

your CUDA directory path is referred to as C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v9.0
your cuDNN directory path is referred to as <installpath>

Navigate to your <installpath> directory containing cuDNN.
Unzip the cuDNN package.

cudnn-9.0-windows7-x64-v7.zip

or

cudnn-9.0-windows10-x64-v7.zip

Copy the following files into the CUDA Toolkit directory.
    Copy <installpath>\cuda\bin\cudnn64_7.dll to C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v9.0\bin.
    Copy <installpath>\cuda\ include\cudnn.h to C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v9.0\include.
    Copy <installpath>\cuda\lib\x64\cudnn.lib to C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v9.0\lib\x64.
Set the following environment variables to point to where cuDNN is located. To access the value of the $(CUDA_PATH) environment variable, perform the following steps:
    Open a command prompt from the Start menu.
    Type Run and hit Enter.
    Issue the control sysdm.cpl command.
    Select the Advanced tab at the top of the window.
    Click Environment Variables at the bottom of the window.
    Ensure the following values are set:

    Variable Name: CUDA_PATH 
    Variable Value: C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v9.0

All these steps are clear to me, but then it says:

_**Include cudnn.lib in your Visual Studio project.
    Open the Visual Studio project and right-click on the project name.
    Click Linker > Input > Additional Dependencies.
    Add cudnn.lib and click OK.**_

How should I take this last operation into account in a UE4 tensorflow project? Thanks for support. For the record, switching the project to tensorflow-gpu puts me on the same spot of crashing on mnistKerasCNN and working fine for other example scripts.

getnamo commented 6 years ago

Last part is not needed. You should only need to copy the cudnn64_7.dll into your C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v9.0\bin as that's the only thing it is looking for from cudNN. But if you had that part missing you would get an error in the console and not a crash. Do you have other cuda sdks installed? what about your path? consider uninstalling the other cuda versions e.g. 8.0/etc to try to eliminate possible conflicts.

What gpu are you trying to train on btw? in theory there could be memory pressures, try to e.g. lower the batch size to say 40 instead of 400 in https://github.com/getnamo/tensorflow-ue4-examples/blob/master/Content/Scripts/mnistKerasCNNOpt.py#L126. Does https://github.com/getnamo/tensorflow-ue4-examples/blob/master/Content/Scripts/mnistKerasCNN.py crash as well? Does the crash window give any additional information (maybe take a screenshot)?

Also note that you can exit early from training e.g. after just a single batch in CPU by hitting 'g' in the mnist examples. Even 1-2 epochs of the keras model is generally sufficient to beat linear regression models.

Gaspa90 commented 6 years ago

Last part is not needed. You should only need to copy the cudnn64_7.dll into your C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v9.0\bin

it is there.

What gpu are you trying to train on btw?

gtx980. this is the output i get from executing device_query from CUDA device query

in theory there could be memory pressures

yesterday i tried to limit GPU usage on the scripts to 30% and batch size to 32 and it correctly does not allocate more than that so it cant be.

Do you have other cuda sdks installed?

I indeed do, I am using 8.0 and cudnn 6 on a separate environment, Nvidia ensures those can coexist togheter. It doesn't seem to me that there is a conflict on path variables: path1 Next one shows PATH entries: path2

Unfortunately i don't have more informations on the logs than thoseI initially posted. The crash is that ue4 stops working: crash

getnamo commented 6 years ago

I'm able to replicate this issue on my laptop, will let you know what I find out when I have a solution.

Gaspa90 commented 6 years ago

I'm able to replicate this issue on my laptop

I have found the root of my problem, don't know how you replicated the problem on your laptop!

I changed mnistKerasCNN and mnistSimple to be executed in jupiter notbooks on my anaconda environment (where i have cuda 8 and cudnn6) and they work fine. Moreover if I try to insert convolutional layers in mnistSimple it causes UE4 to crash as well.

Performing a clean installation of the latest graphic driver solved my issue, apparently the version I had was working on cuda 8 / cudnn6 but not on cuda 9 / cudnn7.

getnamo commented 6 years ago

For some reason my laptop is crashing in a very similar manner, I'm going to try to clean up the dependencies/drivers to resolve it like you did.