NervanaSystems / ngraph-tf

Bridge to connect nGraph with TensorFlow
Other
53 stars 16 forks source link

Problems getting ngraph-tf to run under manjaro #535

Open SleepProgger opened 5 years ago

SleepProgger commented 5 years ago

I try since some days to get ngraph-tf to run under manjaro and ran into multiple problems. The goal is to use ngraph-tf with the plaidml backend.

I am testing with the following code:

import tensorflow as tf
import os
import sys
if os.environ.get("USE_TF_KERAS", "1") == "1":
    import tensorflow.keras as keras
    print("Using tensorflow keras version")
else:
    import keras
    print("Using keras with backend %s" % keras.backend.backend())

if len(sys.argv) < 2:
    backend = "CPU"
else:
    backend = sys.argv[1]
if backend == "NONE":
    print("NOT using ngraph")
else:
    import ngraph_bridge
    print("Supported ngraph backend:\n  %s" % "\n  ".join(ngraph_bridge.list_backends()))
    ngraph_bridge.set_backend(backend)
    print("Using ngraph backend %s" % ngraph_bridge.get_currently_set_backend_name())

mnist = keras.datasets.mnist
(x_train, y_train),(x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0
model = keras.models.Sequential([
  keras.layers.Flatten(input_shape=(28, 28)),
  keras.layers.Dense(512, activation="relu"),
  keras.layers.Dropout(0.2),
  keras.layers.Dense(10, activation="softmax")
])
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

print("Predict:", model.predict(x_train[:1]))

model.fit(x_train, y_train, epochs=5)
model.evaluate(x_test, y_test)

When trying to run it with tensorflow.keras and the ngraph backend set to PLAIDML (USE_TF_KERAS=1 KERAS_BACKEND="tensorflow" python test_ngrapg_tf.py PLAIDML) i get a segfault or this stacktrace (sometimes the one, sometimes the other):

Traceback (most recent call last):
  File "test_ngrapg_tf.py", line 39, in <module>
    model.fit(x_train, y_train, epochs=5)
  File "/run/media/nope/data/home/nope/workspace/test/fs/ngraph-tf_master/build_cmake/venv-tf-py3/lib/python3.5/site-packages/tensorflow/python/keras/engine/training.py", line 880, in fit
    validation_steps=validation_steps)
  File "/run/media/nope/data/home/nope/workspace/test/fs/ngraph-tf_master/build_cmake/venv-tf-py3/lib/python3.5/site-packages/tensorflow/python/keras/engine/training_arrays.py", line 329, in model_iteration
    batch_outs = f(ins_batch)
  File "/run/media/nope/data/home/nope/workspace/test/fs/ngraph-tf_master/build_cmake/venv-tf-py3/lib/python3.5/site-packages/tensorflow/python/keras/backend.py", line 3076, in __call__
    run_metadata=self.run_metadata)
  File "/run/media/nope/data/home/nope/workspace/test/fs/ngraph-tf_master/build_cmake/venv-tf-py3/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1439, in __call__
    run_metadata_ptr)
  File "/run/media/nope/data/home/nope/workspace/test/fs/ngraph-tf_master/build_cmake/venv-tf-py3/lib/python3.5/site-packages/tensorflow/python/framework/errors_impl.py", line 528, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InternalError: Caught exception while compiling op_backend: get_shape() must be called on a node with exactly one output ()

     [[{{node ngraph_cluster_44}}]]

When trying to run it with keras with the keras backend set to tensorflow (USE_TF_KERAS=0 KERAS_BACKEND="tensorflow" python test_ngrapg_tf.py PLAIDML) i reliable get invalid opencl kernels generated by plaidml (see https://github.com/plaidml/plaidml/issues/322)

Both versions can execute the prediction step just fine, altho keras with tensorflow backend seem to produce wrong values.

With only tensorflow or plaidml via keras (or in the case of tf also tf.keras) and without ngraph-tf it runs without a problem (USE_TF_KERAS=1/0 KERAS_BACKEND="tensorflow" python test_ngrapg_tf.py NONE). Those tests where made with a self build version of ngraph-tf with and without the --use_prebuilt_tensorflow parameter.

Using the CPU ngraph backend it runs with keras with tensorflow as keras backend and tf.keras altho way slower as just tensorflow-cpu without ngraph in both cases. Additionally when using keras with backend set to tensorflow the results seem to be wrong.

When trying to run it with the ngraph CPU backend via the pypi version of ngraph-tf installed via pip i get an Illegal instruction crash with keras->tensorflow and tf.keras.

Additional info

I am using python 3.5.5 installed via pyenv.

# uname -a 
Linux seima-pc 5.0.15-1-MANJARO #1 SMP PREEMPT Fri May 10 19:51:04 UTC 2019 x86_64 GNU/Linux

GPU: Radeon RX 580

When compiling ngraph-tf i need to create a link from lib64 to lib in the artifact dir otherwise the ngraph-tf build fails as it expects the lib dir but creates the lib64 dir (not sure if relevant)

Sorry for the wall of text, but i really don't know where it goes wrong. Please let me know if additional information are required.

SleepProgger commented 5 years ago

When solving (although in a very crude way) the invalid opencl kernel generated by plaidml (https://github.com/plaidml/plaidml/issues/322) i now get the same error with tensorflow.keras and keras with the keras backend set to tensorflow, ie:

Traceback (most recent call last):
  File "test_ngrapg_tf.py", line 39, in <module>
    model.fit(x_train, y_train, epochs=5)
  File "/run/media/nope/data/home/nope/workspace/test/fs/ngraph-tf_master/build_cmake/venv-tf-py3/lib/python3.5/site-packages/tensorflow/python/keras/engine/training.py", line 880, in fit
    validation_steps=validation_steps)
  File "/run/media/nope/data/home/nope/workspace/test/fs/ngraph-tf_master/build_cmake/venv-tf-py3/lib/python3.5/site-packages/tensorflow/python/keras/engine/training_arrays.py", line 329, in model_iteration
    batch_outs = f(ins_batch)
  File "/run/media/nope/data/home/nope/workspace/test/fs/ngraph-tf_master/build_cmake/venv-tf-py3/lib/python3.5/site-packages/tensorflow/python/keras/backend.py", line 3076, in __call__
    run_metadata=self.run_metadata)
  File "/run/media/nope/data/home/nope/workspace/test/fs/ngraph-tf_master/build_cmake/venv-tf-py3/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1439, in __call__
    run_metadata_ptr)
  File "/run/media/nope/data/home/nope/workspace/test/fs/ngraph-tf_master/build_cmake/venv-tf-py3/lib/python3.5/site-packages/tensorflow/python/framework/errors_impl.py", line 528, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InternalError: Caught exception while compiling op_backend: get_shape() must be called on a node with exactly one output ()

     [[{{node ngraph_cluster_44}}]]

or a segfault (some times the one, sometimes the other)

I plan to try an ubuntu based distro tomorrow to see if it is in deed manjaro related

SleepProgger commented 5 years ago

Sadly basically same behavior under Mint (Ubuntu LTS based).