IBM / tensorflow-large-model-support

Large Model Support in Tensorflow
Apache License 2.0
201 stars 38 forks source link

SystemError: <built-in function len> returned a result with an error set #39

Closed charlinergr closed 4 years ago

charlinergr commented 4 years ago

Hi !

I'm struggeling to use tensorflow-gpu from the WML CE channel.

I follow all the steps describe in : https://www.ibm.com/support/knowledgecenter/SS5SF7_1.7.0/navigation/wmlce_install.htm

Then, I run:

>> import tensorflow as tf
2020-05-27 16:59:08.988965: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.2
2020-05-27 16:59:34.412988: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer.so.7
2020-05-27 16:59:34.730903: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer_plugin.so.7
>>> tf.config.experimental.set_lms_enabled(True)

which mean the installation work well, right ?

But when I run my code I got this issue: SystemError: <built-in function len> returned a result with an error set which I have not in TF2.1 without large model.

Traceback (most recent call last):
  File "train_nx_graph_param.py", line 198, in <module>
    outputs_tr, loss_tr = compiled_update_step(input_graphs, target_graphs)
  File "/sps/l2it/crougier/conda/envs/tflm/lib/python3.7/site-packages/tensorflow_core/python/eager/def_function.py", line 568, in __call__
    result = self._call(*args, **kwds)
  File "/sps/l2it/crougier/conda/envs/tflm/lib/python3.7/site-packages/tensorflow_core/python/eager/def_function.py", line 615, in _call
    self._initialize(args, kwds, add_initializers_to=initializers)
  File "/sps/l2it/crougier/conda/envs/tflm/lib/python3.7/site-packages/tensorflow_core/python/eager/def_function.py", line 497, in _initialize
    *args, **kwds))
  File "/sps/l2it/crougier/conda/envs/tflm/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py", line 2389, in _get_concrete_function_internal_garbage_collected
    graph_function, _, _ = self._maybe_define_function(args, kwargs)
  File "/sps/l2it/crougier/conda/envs/tflm/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py", line 2703, in _maybe_define_function
    graph_function = self._create_graph_function(args, kwargs)
  File "/sps/l2it/crougier/conda/envs/tflm/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py", line 2593, in _create_graph_function
    capture_by_value=self._capture_by_value),
  File "/sps/l2it/crougier/conda/envs/tflm/lib/python3.7/site-packages/tensorflow_core/python/framework/func_graph.py", line 978, in func_graph_from_py_func
    func_outputs = python_func(*func_args, **func_kwargs)
  File "/sps/l2it/crougier/conda/envs/tflm/lib/python3.7/site-packages/tensorflow_core/python/eager/def_function.py", line 439, in wrapped_fn
    return weak_wrapped_fn().__wrapped__(*args, **kwds)
  File "/sps/l2it/crougier/conda/envs/tflm/lib/python3.7/site-packages/tensorflow_core/python/eager/def_function.py", line 568, in __call__
    result = self._call(*args, **kwds)
  File "/sps/l2it/crougier/conda/envs/tflm/lib/python3.7/site-packages/tensorflow_core/python/eager/def_function.py", line 615, in _call
    self._initialize(args, kwds, add_initializers_to=initializers)
  File "/sps/l2it/crougier/conda/envs/tflm/lib/python3.7/site-packages/tensorflow_core/python/eager/def_function.py", line 497, in _initialize
    *args, **kwds))
  File "/sps/l2it/crougier/conda/envs/tflm/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py", line 2389, in _get_concrete_function_internal_garbage_collected
    graph_function, _, _ = self._maybe_define_function(args, kwargs)
  File "/sps/l2it/crougier/conda/envs/tflm/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py", line 2703, in _maybe_define_function
    graph_function = self._create_graph_function(args, kwargs)
  File "/sps/l2it/crougier/conda/envs/tflm/lib/python3.7/site-packages/tensorflow_core/python/eager/function.py", line 2593, in _create_graph_function
    capture_by_value=self._capture_by_value),
  File "/sps/l2it/crougier/conda/envs/tflm/lib/python3.7/site-packages/tensorflow_core/python/framework/func_graph.py", line 978, in func_graph_from_py_func
    func_outputs = python_func(*func_args, **func_kwargs)
  File "/sps/l2it/crougier/conda/envs/tflm/lib/python3.7/site-packages/tensorflow_core/python/eager/def_function.py", line 439, in wrapped_fn
    return weak_wrapped_fn().__wrapped__(*args, **kwds)
  File "/sps/l2it/crougier/conda/envs/tflm/lib/python3.7/site-packages/tensorflow_core/python/framework/func_graph.py", line 968, in wrapper
    raise e.ag_error_metadata.to_exception(e)
tensorflow.python.autograph.impl.api.StagingError: in converted code:

    train_nx_graph_param.py:127 update_step  *
        grad = tape.gradient(loss_tr, model.variables)#,unconnected_gradients='none')
    /sps/l2it/crougier/conda/envs/tflm/lib/python3.7/site-packages/tensorflow_core/python/eager/backprop.py:1029 gradient
        unconnected_gradients=unconnected_gradients)
    /sps/l2it/crougier/conda/envs/tflm/lib/python3.7/site-packages/tensorflow_core/python/eager/imperative_grad.py:77 imperative_grad
        compat.as_str(unconnected_gradients.value))
    /sps/l2it/crougier/conda/envs/tflm/lib/python3.7/site-packages/tensorflow_core/python/eager/backprop.py:609 _aggregate_grads
        if len(gradients) == 1:

    SystemError: <built-in function len> returned a result with an error set

Did I do something wrong ? I suppose that I did not understand something as it is working in another environnement.

jayfurmanek commented 4 years ago

Hi. Can you share the code you ran to get this error?

smatzek commented 4 years ago

Does this error happen if using the WML CE tensorflow-gpu if you do not call tf.config.experimental.set_lms_enabled(True)?

charlinergr commented 4 years ago

yes !

jayfurmanek commented 4 years ago

Thank you! This looks like a snippet of a larger program. It will help if you have a self contained reproducible script. Is that possible?

Can you also post the output of conda list?

Thanks.

jayfurmanek commented 4 years ago

The above matches with the description of this issue:

https://github.com/tensorflow/tensorflow/issues/31962

Which was fixed in this commit:

https://github.com/tensorflow/tensorflow/commit/92af5842b003b8e223859bed4f0dabed400ede0f

Which didn't make it into 2.1.

You can apply that code change to your installation to see. Please report back if that does indeed fix your issue. There are other workarounds posted in that issue above (include removing the @tf.function declaration) that are worth trying as well.

If this fixes your error, this is certainly a candidate for a back-port fix. Thanks for reporting.

charlinergr commented 4 years ago

The change in https://github.com/tensorflow/tensorflow/commit/92af5842b003b8e223859bed4f0dabed400ede0f seems already done on the tensorflow 2.1 from IBM.

I change the

tf.GradientTape()

by

tf.gradients

and it works just fine now.

Thank you for the help !