aws-neuron / aws-neuron-sdk

Powering AWS purpose-built machine learning chips. Blazing fast and cost effective, natively integrated into PyTorch and TensorFlow and integrated with your favorite AWS services
https://aws.amazon.com/machine-learning/neuron/
Other
442 stars 145 forks source link

accessing Jupyter notebook remotely #734

Closed tahsintahsin closed 12 months ago

tahsintahsin commented 1 year ago

Hello I am working in an inf1 instance, using ubuntu 22. I followed the instructions in the documentation and installed drivers and sdk. I got no errors on this section. Following the original instructions, it creates a venv and installs Jupyter notebook in it. I again followed the rest of the tutorial and run the Jupyter notebook command, but there was no link appearing in the terminal output that I can connect to remotely from my local. Then I thought it might be venv issue and did all the steps again and installed all globally and ran Jupyter notebook from terminal again, not within a venv this time. However, the output is the same and there is no url appearing as shown in the documentation. There was a troubleshooting section for Jupyter notebook connection, I tried that as well and that is not helping either. How can I proceed now ? Thanks

jyang-aws commented 1 year ago

Hi @tahsintahsin,

It sounds like a Jupyter notebook setup issue prior to running a job on inf1. Could you double check the installation steps in https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/setup/neuron-setup/pytorch/neuronx/ubuntu/torch-neuronx-ubuntu22.html#setup-torch-neuronx-ubuntu22 and settings?

pip install ipykernel python3.10 -m ipykernel install --user --name aws_neuron_venv_pytorch --display-name "Python (torch-neuronx)" pip install jupyter notebook pip install environment_kernels

Meanwhile, you can also run the python scripts directly without invoking a notebook: create a python script say test.py and run it with python3 test.py.

tahsintahsin commented 1 year ago

Hi @jyang-aws I was able to access to the url by running the Jupyter notebook with --allow-root. However, I am now stuck further in the process I am trying to convert a rasa model (tf based) into inf1 compiled version, for which I need to install rasa python package first but I am seeing a lot of package version incompatibilities. Tried to solve some of them but there are many. Before I potentially mark it as undoable, maybe you can suggest me any overall tips, maybe someone tried similar task and you have some idea? Many thanks

tahsintahsin commented 1 year ago

Hi again @jyang-aws, rather than using rasa, I decided to use bare labse model and replace rasa code with custom code myself, because I really want to see and test the performance of inferentia servers so that we might go to production with them. I pulled the model from tensorflow hub, just loaded it and ran the example given in the model link. https://tfhub.dev/google/LaBSE/2 It works fine Then I tried to compile it with inferentia. I kept getting CPU issues. That it was getting over 100%. I started increasing the instance type, and even got to inf1.24xlarge, but even in that, I kept getting the CPU issues and eventually my compilation got stuck with this error message. Can you please recommend me an instance where I can compile labse? Thanks

2023-08-31 13:32:31.837863: I tensorflow/core/grappler/devices.cc:66] Number of eligible GPUs (core count >= 8, compute capability >= 0.0): 0 2023-08-31 13:32:31.838096: I tensorflow/core/grappler/clusters/single_machine.cc:358] Starting new session 2023-08-31 13:33:18.741540: I tensorflow/core/grappler/devices.cc:66] Number of eligible GPUs (core count >= 8, compute capability >= 0.0): 0 2023-08-31 13:33:18.741727: I tensorflow/core/grappler/clusters/single_machine.cc:358] Starting new session 2023-08-31 13:33:41.959279: I tensorflow/neuron/grappler/convert/segment.cc:456] There are 9 ops of 4 different types in the graph that are not compiled by neuron-cc: GatherV2, OneHot, Placeholder, NoOp, (For more information see https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/neuron-cc-ops/neuron-cc-ops-tensorflow.html). 2023-08-31 13:33:54.209354: I tensorflow/core/grappler/devices.cc:66] Number of eligible GPUs (core count >= 8, compute capability >= 0.0): 0 2023-08-31 13:33:54.209505: I tensorflow/core/grappler/clusters/single_machine.cc:358] Starting new session 2023-08-31 13:34:02.299414: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-08-31 13:34:02.756860: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:268] disabling MLIR crash reproducer, set env varMLIR_CRASH_REPRODUCER_DIRECTORY` to enable.

`

jyang-aws commented 1 year ago

Hi @tahsintahsin

Could you try inf2 24xlarge or 48 xlarge? they have more vCPU and memory sizes, and newer version of the accelerator.

tahsintahsin commented 1 year ago

Hello @jyang-aws I just tried inf2 24xlarge, and got the issue below:

`2023-09-05 12:51:27.988503: I tensorflow/core/grappler/devices.cc:66] Number of eligible GPUs (core count >= 8, compute capability >= 0.0): 0 2023-09-05 12:51:27.988658: I tensorflow/core/grappler/clusters/single_machine.cc:358] Starting new session 2023-09-05 12:51:55.036299: I tensorflow/core/grappler/devices.cc:66] Number of eligible GPUs (core count >= 8, compute capability >= 0.0): 0 2023-09-05 12:51:55.036443: I tensorflow/core/grappler/clusters/single_machine.cc:358] Starting new session 2023-09-05 12:52:06.698203: I tensorflow/neuron/grappler/convert/segment.cc:456] There are 8 ops of 3 different types in the graph that are not compiled by neuron-cc: OneHot, Placeholder, NoOp, (For more information see https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/neuron-cc-ops/neuron-cc-ops-tensorflow.html). 2023-09-05 12:52:14.565598: I tensorflow/core/grappler/devices.cc:66] Number of eligible GPUs (core count >= 8, compute capability >= 0.0): 0 2023-09-05 12:52:14.565768: I tensorflow/core/grappler/clusters/single_machine.cc:358] Starting new session 2023-09-05 12:52:17.704945: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-09-05 12:52:17.822686: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:268] disabling MLIR crash reproducer, set env var 'MLIR_CRASH_REPRODUCER_DIRECTORY' to enable.

Compiler status PASS

AttributeError Traceback (most recent call last) Cell In[6], line 1 ----> 1 model_neuron = tfnx.trace(encoder, preprocessor(english_sentences))

File ~/aws_neuron_venv_tensorflow/lib/python3.10/site-packages/tensorflow_neuronx/_trace.py:8, in trace(func, example_inputs, subgraph_builder_function) 6 def trace(func, example_inputs, subgraph_builder_function=None): 7 with _neuronx_cc_context(): ----> 8 return tfn_trace(func, example_inputs, subgraph_builder_function)

File ~/aws_neuron_venv_tensorflow/lib/python3.10/site-packages/tensorflow_neuron/python/_trace.py:241, in trace(func, example_inputs, subgraph_builder_function) 238 model = AwsNeuronModel(cfunc, func.structured_outputs, real_op_count, ordered_weights=ordered_weights) 239 else: 240 # wrap GraphDef as a WrappedFunction --> 241 cfunc = _wrap_graph_def_as_concrete_function(graph_def, func) 242 # wrap ConcreteFunction as a keras model 243 model = AwsNeuronModel(cfunc, func.structured_outputs, real_op_count)

File ~/aws_neuron_venv_tensorflow/lib/python3.10/site-packages/tensorflow_neuron/python/_trace.py:605, in _wrap_graph_def_as_concrete_function(graph_def, func_ref) 601 def _wrap_graph_def_as_concrete_function(graph_def, func_ref): 602 # Note: if input_names is a dictionary (such as '{ts.name: ts.name for ts in example_inputs}'), 603 # then the WrappedFunction may occationally have feeding tensors going to the wrong inputs. 604 input_names = _get_input_names(func_ref) --> 605 output_names = _get_output_names(func_ref) 606 cfunc = wrap_function.function_from_graph_def(graph_def, input_names, output_names) 608 # TODO: remove this hack once https://github.com/tensorflow/tensorflow/blob/v2.3.1/tensorflow/python/eager/wrap_function.py#L377 is fixed

File ~/aws_neuron_venv_tensorflow/lib/python3.10/site-packages/tensorflow_neuron/python/_trace.py:576, in _get_output_names(func) 572 if isinstance(structured_outputs, dict): 573 # return a map from output argument name to symbolic tensor name 574 # in order to let the WrappedFunction's return dictionary have the correct keys 575 tensor_specs = nest.flatten(structured_outputs, expand_composites=True) --> 576 tensor_spec_name_map = {spec.name: name for name, spec in structured_outputs.items()} 577 tensor_spec_names = [tensor_spec_name_map[spec.name] for spec in tensor_specs] 578 return {name: ts.name for ts, name in zip(outputs, tensor_spec_names)}

File ~/aws_neuron_venv_tensorflow/lib/python3.10/site-packages/tensorflow_neuron/python/_trace.py:576, in (.0) 572 if isinstance(structured_outputs, dict): 573 # return a map from output argument name to symbolic tensor name 574 # in order to let the WrappedFunction's return dictionary have the correct keys 575 tensor_specs = nest.flatten(structured_outputs, expand_composites=True) --> 576 tensor_spec_name_map = {spec.name: name for name, spec in structured_outputs.items()} 577 tensor_spec_names = [tensor_spec_name_map[spec.name] for spec in tensor_specs] 578 return {name: ts.name for ts, name in zip(outputs, tensor_spec_names)}

AttributeError: 'list' object has no attribute 'name'`

Should I try the largest, 48 x large, does it seem to be because of it ?

Another question is, I am not using --neuroncore-pipeline-cores arg because I am running from Jupyter notebook. How does it behave in that case ? Does the compiler use all available cores by default ?

Thanks

jyang-aws commented 1 year ago

@tahsintahsin This is a separate issue, not related to instance type you used. Could you share with us a minimum test case to reproduce? As to your question about --neuroncore-pipeline-cores, you can try:

tfn.saved_model.compile(....
                        compiler_args = ['--neuroncore-pipeline-cores', '16'])

https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/arch/neuron-features/neuroncore-pipeline.html

tahsintahsin commented 1 year ago

@jyang-aws Thanks for the help again, After installation of tensorflow_text with pip install tensorflow_text==2.10 (I think your version was 2.10, the matching version of tf text is getting installed with no issues) what I am running is:

import tensorflow_hub as hub import tensorflow as tf import tensorflow_text as text import numpy as np import tensorflow_neuronx as tfnx

def normalization(embeds): norms = np.linalg.norm(embeds, 2, axis=1, keepdims=True) return embeds/norms

english_sentences = tf.constant(["dog", "Puppies are nice.", "I enjoy taking long walks along the beach with my dog."]) italian_sentences = tf.constant(["cane", "I cuccioli sono carini.", "Mi piace fare lunghe passeggiate lungo la spiaggia con il mio cane."]) japanese_sentences = tf.constant(["犬", "子犬はいいです", "私は犬と一緒にビーチを散歩するのが好きです"])

preprocessor = hub.KerasLayer( "https://tfhub.dev/google/universal-sentence-encoder-cmlm/multilingual-preprocess/2") encoder = hub.KerasLayer("https://tfhub.dev/google/LaBSE/2")

english_embeds = encoder(preprocessor(english_sentences))["default"] japanese_embeds = encoder(preprocessor(japanese_sentences))["default"] italian_embeds = encoder(preprocessor(italian_sentences))["default"]

english_embeds = normalization(english_embeds) japanese_embeds = normalization(japanese_embeds) italian_embeds = normalization(italian_embeds)

print (np.matmul(english_embeds, np.transpose(italian_embeds)))

print (np.matmul(english_embeds, np.transpose(japanese_embeds)))

print (np.matmul(italian_embeds, np.transpose(japanese_embeds)))

neuron_labse = tfnx.trace(encoder,preprocessor(english_sentences))

tahsintahsin commented 1 year ago

@jyang-aws I have tried the same code with inf2.48xlarge and the output is as follows, compiler status is different but it has the same end error

2023-09-06 12:10:15.670511: I tensorflow/core/grappler/devices.cc:66] Number of eligible GPUs (core count >= 8, compute capability >= 0.0): 0 2023-09-06 12:10:15.670740: I tensorflow/core/grappler/clusters/single_machine.cc:358] Starting new session 2023-09-06 12:10:43.120439: I tensorflow/core/grappler/devices.cc:66] Number of eligible GPUs (core count >= 8, compute capability >= 0.0): 0 2023-09-06 12:10:43.120734: I tensorflow/core/grappler/clusters/single_machine.cc:358] Starting new session 2023-09-06 12:10:54.844171: I tensorflow/neuron/grappler/convert/segment.cc:456] There are 9 ops of 4 different types in the graph that are not compiled by neuron-cc: GatherV2, OneHot, Placeholder, NoOp, (For more information see https://awsdocs-neuron.readthedocs-hosted.com/en/latest/release-notes/neuron-cc-ops/neuron-cc-ops-tensorflow.html). 2023-09-06 12:11:02.727643: I tensorflow/core/grappler/devices.cc:66] Number of eligible GPUs (core count >= 8, compute capability >= 0.0): 0 2023-09-06 12:11:02.727974: I tensorflow/core/grappler/clusters/single_machine.cc:358] Starting new session 2023-09-06 12:11:05.606759: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-09-06 12:11:05.727959: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:268] disabling MLIR crash reproducer, set env var MLIR_CRASH_REPRODUCER_DIRECTORY to enable. .... Compiler status ERROR WARNING:tensorflow:neuron-cc failed with: 2023-09-06 12:13:00.761933: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used. 2023-09-06 12:13:21.109072: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used. 2023-09-06 12:13:21.110063: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-09-06 12:13:43.159410: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT 09/06/2023 12:13:59 PM ERROR 17181 [neuron-cc]: 09/06/2023 12:13:59 PM ERROR 17181 [neuron-cc]: An Internal Compiler Error has occurred 09/06/2023 12:13:59 PM ERROR 17181 [neuron-cc]: 09/06/2023 12:13:59 PM ERROR 17181 [neuron-cc]: 09/06/2023 12:13:59 PM ERROR 17181 [neuron-cc]: Error message: /usr/local/lib/python3.10/dist-packages/tensorflow-plugins/libaws_neuron_plugin.so: undefined symbol: _ZTIN10tensorflow9AllocatorE 09/06/2023 12:13:59 PM ERROR 17181 [neuron-cc]: 09/06/2023 12:13:59 PM ERROR 17181 [neuron-cc]: Error class: NotFoundError 09/06/2023 12:13:59 PM ERROR 17181 [neuron-cc]: Error location: Unknown 09/06/2023 12:13:59 PM ERROR 17181 [neuron-cc]: Command line: /usr/local/bin/neuron-cc compile /tmp/tmpsp95qbbl/hlo_module.pb --framework XLA --verbose=35 --pipeline compile SaveTemps --output /tmp/tmpsp95qbbl/hlo_module.neff --fast-math=none --fp32-cast=matmult-fp16 --enable-fast-context-switch 09/06/2023 12:13:59 PM ERROR 17181 [neuron-cc]: 09/06/2023 12:13:59 PM ERROR 17181 [neuron-cc]: Internal details: 09/06/2023 12:13:59 PM ERROR 17181 [neuron-cc]: File "neuroncc/driver/CommandDriver.py", line 224, in neuroncc.driver.CommandDriver.CommandDriver.run 09/06/2023 12:13:59 PM ERROR 17181 [neuron-cc]: File "neuroncc/driver/commands/CompileCommand.py", line 580, in neuroncc.driver.commands.CompileCommand.CompileCommand.run 09/06/2023 12:13:59 PM ERROR 17181 [neuron-cc]: File "neuroncc/driver/commands/CompileCommand.py", line 558, in neuroncc.driver.commands.CompileCommand.CompileCommand.runPipeline 09/06/2023 12:13:59 PM ERROR 17181 [neuron-cc]: File "neuroncc/driver/commands/CompileCommand.py", line 562, in neuroncc.driver.commands.CompileCommand.CompileCommand.runPipeline 09/06/2023 12:13:59 PM ERROR 17181 [neuron-cc]: File "neuroncc/driver/Job.py", line 289, in neuroncc.driver.Job.SingleInputJob.run 09/06/2023 12:13:59 PM ERROR 17181 [neuron-cc]: File "neuroncc/driver/Pipeline.py", line 30, in neuroncc.driver.Pipeline.Pipeline.runSingleInput 09/06/2023 12:13:59 PM ERROR 17181 [neuron-cc]: File "neuroncc/driver/Job.py", line 289, in neuroncc.driver.Job.SingleInputJob.run 09/06/2023 12:13:59 PM ERROR 17181 [neuron-cc]: File "neuroncc/driver/Pipeline.py", line 30, in neuroncc.driver.Pipeline.Pipeline.runSingleInput 09/06/2023 12:13:59 PM ERROR 17181 [neuron-cc]: File "neuroncc/driver/Job.py", line 289, in neuroncc.driver.Job.SingleInputJob.run 09/06/2023 12:13:59 PM ERROR 17181 [neuron-cc]: File "neuroncc/driver/jobs/Frontend.py", line 431, in neuroncc.driver.jobs.Frontend.Frontend.runSingleInput 09/06/2023 12:13:59 PM ERROR 17181 [neuron-cc]: File "neuroncc/driver/jobs/Frontend.py", line 210, in neuroncc.driver.jobs.Frontend.Frontend.runXLAFrontend 09/06/2023 12:13:59 PM ERROR 17181 [neuron-cc]: File "neuroncc/driver/jobs/support/Frameworks.py", line 1013, in neuroncc.driver.jobs.support.Frameworks.XLAInterface.loadModel 09/06/2023 12:13:59 PM ERROR 17181 [neuron-cc]: File "/usr/local/lib/python3.10/dist-packages/tensorflow/init.py", line 441, in 09/06/2023 12:13:59 PM ERROR 17181 [neuron-cc]: _ll.load_library(_plugin_dir) 09/06/2023 12:13:59 PM ERROR 17181 [neuron-cc]: File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/framework/load_library.py", line 151, in load_library 09/06/2023 12:13:59 PM ERROR 17181 [neuron-cc]: py_tf.TF_LoadLibrary(lib) 09/06/2023 12:13:59 PM ERROR 17181 [neuron-cc]: 09/06/2023 12:13:59 PM ERROR 17181 [neuron-cc]: Version information: 09/06/2023 12:13:59 PM ERROR 17181 [neuron-cc]: Neuron Compiler version 1.18.0.0+ba9e7fa5a 09/06/2023 12:13:59 PM ERROR 17181 [neuron-cc]:
09/06/2023 12:13:59 PM ERROR 17181 [neuron-cc]: HWM version 1.15.0.0-ea99c2b8f 09/06/2023 12:13:59 PM ERROR 17181 [neuron-cc]: NEFF version Dynamic 09/06/2023 12:13:59 PM ERROR 17181 [neuron-cc]: TVM version 1.17.0.0+0 09/06/2023 12:13:59 PM ERROR 17181 [neuron-cc]: NumPy version 1.23.5 09/06/2023 12:13:59 PM ERROR 17181 [neuron-cc]: MXNet not available 09/06/2023 12:13:59 PM ERROR 17181 [neuron-cc]: TF not available 09/06/2023 12:13:59 PM ERROR 17181 [neuron-cc]: 09/06/2023 12:13:59 PM ERROR 17181 [neuron-cc]: Artifacts stored in: /tmp/tmpsp95qbbl

WARNING:tensorflow:neuron-cc failed with: 2023-09-06 12:13:00.761933: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used. 2023-09-06 12:13:21.109072: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used. 2023-09-06 12:13:21.110063: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-09-06 12:13:43.159410: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT 09/06/2023 12:13:59 PM ERROR 17181 [neuron-cc]: 09/06/2023 12:13:59 PM ERROR 17181 [neuron-cc]: An Internal Compiler Error has occurred 09/06/2023 12:13:59 PM ERROR 17181 [neuron-cc]: 09/06/2023 12:13:59 PM ERROR 17181 [neuron-cc]: 09/06/2023 12:13:59 PM ERROR 17181 [neuron-cc]: Error message: /usr/local/lib/python3.10/dist-packages/tensorflow-plugins/libaws_neuron_plugin.so: undefined symbol: _ZTIN10tensorflow9AllocatorE 09/06/2023 12:13:59 PM ERROR 17181 [neuron-cc]: 09/06/2023 12:13:59 PM ERROR 17181 [neuron-cc]: Error class: NotFoundError 09/06/2023 12:13:59 PM ERROR 17181 [neuron-cc]: Error location: Unknown 09/06/2023 12:13:59 PM ERROR 17181 [neuron-cc]: Command line: /usr/local/bin/neuron-cc compile /tmp/tmpsp95qbbl/hlo_module.pb --framework XLA --verbose=35 --pipeline compile SaveTemps --output /tmp/tmpsp95qbbl/hlo_module.neff --fast-math=none --fp32-cast=matmult-fp16 --enable-fast-context-switch 09/06/2023 12:13:59 PM ERROR 17181 [neuron-cc]: 09/06/2023 12:13:59 PM ERROR 17181 [neuron-cc]: Internal details: 09/06/2023 12:13:59 PM ERROR 17181 [neuron-cc]: File "neuroncc/driver/CommandDriver.py", line 224, in neuroncc.driver.CommandDriver.CommandDriver.run 09/06/2023 12:13:59 PM ERROR 17181 [neuron-cc]: File "neuroncc/driver/commands/CompileCommand.py", line 580, in neuroncc.driver.commands.CompileCommand.CompileCommand.run 09/06/2023 12:13:59 PM ERROR 17181 [neuron-cc]: File "neuroncc/driver/commands/CompileCommand.py", line 558, in neuroncc.driver.commands.CompileCommand.CompileCommand.runPipeline 09/06/2023 12:13:59 PM ERROR 17181 [neuron-cc]: File "neuroncc/driver/commands/CompileCommand.py", line 562, in neuroncc.driver.commands.CompileCommand.CompileCommand.runPipeline 09/06/2023 12:13:59 PM ERROR 17181 [neuron-cc]: File "neuroncc/driver/Job.py", line 289, in neuroncc.driver.Job.SingleInputJob.run 09/06/2023 12:13:59 PM ERROR 17181 [neuron-cc]: File "neuroncc/driver/Pipeline.py", line 30, in neuroncc.driver.Pipeline.Pipeline.runSingleInput 09/06/2023 12:13:59 PM ERROR 17181 [neuron-cc]: File "neuroncc/driver/Job.py", line 289, in neuroncc.driver.Job.SingleInputJob.run 09/06/2023 12:13:59 PM ERROR 17181 [neuron-cc]: File "neuroncc/driver/Pipeline.py", line 30, in neuroncc.driver.Pipeline.Pipeline.runSingleInput 09/06/2023 12:13:59 PM ERROR 17181 [neuron-cc]: File "neuroncc/driver/Job.py", line 289, in neuroncc.driver.Job.SingleInputJob.run 09/06/2023 12:13:59 PM ERROR 17181 [neuron-cc]: File "neuroncc/driver/jobs/Frontend.py", line 431, in neuroncc.driver.jobs.Frontend.Frontend.runSingleInput 09/06/2023 12:13:59 PM ERROR 17181 [neuron-cc]: File "neuroncc/driver/jobs/Frontend.py", line 210, in neuroncc.driver.jobs.Frontend.Frontend.runXLAFrontend 09/06/2023 12:13:59 PM ERROR 17181 [neuron-cc]: File "neuroncc/driver/jobs/support/Frameworks.py", line 1013, in neuroncc.driver.jobs.support.Frameworks.XLAInterface.loadModel 09/06/2023 12:13:59 PM ERROR 17181 [neuron-cc]: File "/usr/local/lib/python3.10/dist-packages/tensorflow/init.py", line 441, in 09/06/2023 12:13:59 PM ERROR 17181 [neuron-cc]: _ll.load_library(_plugin_dir) 09/06/2023 12:13:59 PM ERROR 17181 [neuron-cc]: File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/framework/load_library.py", line 151, in load_library 09/06/2023 12:13:59 PM ERROR 17181 [neuron-cc]: py_tf.TF_LoadLibrary(lib) 09/06/2023 12:13:59 PM ERROR 17181 [neuron-cc]: 09/06/2023 12:13:59 PM ERROR 17181 [neuron-cc]: Version information: 09/06/2023 12:13:59 PM ERROR 17181 [neuron-cc]: Neuron Compiler version 1.18.0.0+ba9e7fa5a 09/06/2023 12:13:59 PM ERROR 17181 [neuron-cc]:
09/06/2023 12:13:59 PM ERROR 17181 [neuron-cc]: HWM version 1.15.0.0-ea99c2b8f 09/06/2023 12:13:59 PM ERROR 17181 [neuron-cc]: NEFF version Dynamic 09/06/2023 12:13:59 PM ERROR 17181 [neuron-cc]: TVM version 1.17.0.0+0 09/06/2023 12:13:59 PM ERROR 17181 [neuron-cc]: NumPy version 1.23.5 09/06/2023 12:13:59 PM ERROR 17181 [neuron-cc]: MXNet not available 09/06/2023 12:13:59 PM ERROR 17181 [neuron-cc]: TF not available 09/06/2023 12:13:59 PM ERROR 17181 [neuron-cc]: 09/06/2023 12:13:59 PM ERROR 17181 [neuron-cc]: Artifacts stored in: /tmp/tmpsp95qbbl


AttributeError Traceback (most recent call last) Cell In[11], line 1 ----> 1 model_neuron = tfn.trace(encoder, preprocessor(english_sentences))

File ~/aws_neuron_venv_tensorflow/lib/python3.10/site-packages/tensorflow_neuron/python/_trace.py:241, in trace(func, example_inputs, subgraph_builder_function) 238 model = AwsNeuronModel(cfunc, func.structured_outputs, real_op_count, ordered_weights=ordered_weights) 239 else: 240 # wrap GraphDef as a WrappedFunction --> 241 cfunc = _wrap_graph_def_as_concrete_function(graph_def, func) 242 # wrap ConcreteFunction as a keras model 243 model = AwsNeuronModel(cfunc, func.structured_outputs, real_op_count)

File ~/aws_neuron_venv_tensorflow/lib/python3.10/site-packages/tensorflow_neuron/python/_trace.py:605, in _wrap_graph_def_as_concrete_function(graph_def, func_ref) 601 def _wrap_graph_def_as_concrete_function(graph_def, func_ref): 602 # Note: if input_names is a dictionary (such as {ts.name: ts.name for ts in example_inputs}), 603 # then the WrappedFunction may occationally have feeding tensors going to the wrong inputs. 604 input_names = _get_input_names(func_ref) --> 605 output_names = _get_output_names(func_ref) 606 cfunc = wrap_function.function_from_graph_def(graph_def, input_names, output_names) 608 # TODO: remove this hack once https://github.com/tensorflow/tensorflow/blob/v2.3.1/tensorflow/python/eager/wrap_function.py#L377 is fixed

File ~/aws_neuron_venv_tensorflow/lib/python3.10/site-packages/tensorflow_neuron/python/_trace.py:576, in _get_output_names(func) 572 if isinstance(structured_outputs, dict): 573 # return a map from output argument name to symbolic tensor name 574 # in order to let the WrappedFunction's return dictionary have the correct keys 575 tensor_specs = nest.flatten(structured_outputs, expand_composites=True) --> 576 tensor_spec_name_map = {spec.name: name for name, spec in structured_outputs.items()} 577 tensor_spec_names = [tensor_spec_name_map[spec.name] for spec in tensor_specs] 578 return {name: ts.name for ts, name in zip(outputs, tensor_spec_names)}

File ~/aws_neuron_venv_tensorflow/lib/python3.10/site-packages/tensorflow_neuron/python/_trace.py:576, in (.0) 572 if isinstance(structured_outputs, dict): 573 # return a map from output argument name to symbolic tensor name 574 # in order to let the WrappedFunction's return dictionary have the correct keys 575 tensor_specs = nest.flatten(structured_outputs, expand_composites=True) --> 576 tensor_spec_name_map = {spec.name: name for name, spec in structured_outputs.items()} 577 tensor_spec_names = [tensor_spec_name_map[spec.name] for spec in tensor_specs] 578 return {name: ts.name for ts, name in zip(outputs, tensor_spec_names)}

AttributeError: 'list' object has no attribute 'name'

tahsintahsin commented 1 year ago

@jyang-aws any updates on the issue? thanks

jeffhataws commented 1 year ago

@tahsintahsin, we have reproduced your issue on tensorflow-neuronx and will investigate. Thanks!

tahsintahsin commented 1 year ago

@jeffhataws Hello again, any updates on the issue ? We are willing to use inferentia in our production system if possible, so waiting for your updates

jeffhataws commented 1 year ago

Thanks for checking back @tahsintahsin.

It appears your model's output has a dictionary where the values are lists which we do not currently support. We may work on supporting this in the future. You can unblock yourself by wrapping the model so that it returns a list instead of a dictionary like so:

class NeuronEncoderWrapper(tf.keras.Model):
    def __init__(self, model):
        super().__init__()
        self.model = model
    def __call__(self, example_inputs):
        intermediate = self.model(example_inputs)
        return [intermediate['encoder_outputs'], intermediate['default'], intermediate['pooled_output'], intermediate['sequence_output']]

...

wrapped_encoder = NeuronEncoderWrapper(encoder)
neuron_labse = tfnx.trace(wrapped_encoder, example_input)

For your reference I have also made the modifications to your original script, which you can test.

import tensorflow_hub as hub
import tensorflow as tf
import tensorflow_text as text
import numpy as np
import tensorflow_neuronx as tfnx

def normalization(embeds):
    norms = np.linalg.norm(embeds, 2, axis=1, keepdims=True)
    return embeds/norms

english_sentences = tf.constant(["dog", "Puppies are nice.", "I enjoy taking long walks along the beach with my dog."])
italian_sentences = tf.constant(["cane", "I cuccioli sono carini.", "Mi piace fare lunghe passeggiate lungo la spiaggia con il mio cane."])
japanese_sentences = tf.constant(["犬", "子犬はいいです", "私は犬と一緒にビーチを散歩するのが好きです"])

preprocessor = hub.KerasLayer(
"https://tfhub.dev/google/universal-sentence-encoder-cmlm/multilingual-preprocess/2")
encoder = hub.KerasLayer("https://tfhub.dev/google/LaBSE/2")

class NeuronEncoderWrapper(tf.keras.Model):
    def __init__(self, model):
        super().__init__()
        self.model = model
    def __call__(self, example_inputs):
        intermediate = self.model(example_inputs)
        return [intermediate['encoder_outputs'], intermediate['default'], intermediate['pooled_output'], intermediate['sequence_output']]

english_embeds = encoder(preprocessor(english_sentences))["default"]
japanese_embeds = encoder(preprocessor(japanese_sentences))["default"]
italian_embeds = encoder(preprocessor(italian_sentences))["default"]

english_embeds = normalization(english_embeds)
japanese_embeds = normalization(japanese_embeds)
italian_embeds = normalization(italian_embeds)

print (np.matmul(english_embeds, np.transpose(italian_embeds)))

print (np.matmul(english_embeds, np.transpose(japanese_embeds)))

print (np.matmul(italian_embeds, np.transpose(japanese_embeds)))

example_input = preprocessor(english_sentences)
wrapped_encoder = NeuronEncoderWrapper(encoder)
neuron_labse = tfnx.trace(wrapped_encoder, example_input)

for i in range(1000):
    print(neuron_labse(example_input))
tahsintahsin commented 12 months ago

Hello @jeffhataws Many Thanks this time it worked :)