cms-ml / cmsml

Python Package of the CMS Machine Learning Group
https://cmsml.readthedocs.io
BSD 3-Clause "New" or "Revised" License
19 stars 6 forks source link

Exception: tf2xla_supported_ops command failed with exit code 127 #21

Open LinGeLin opened 3 days ago

LinGeLin commented 3 days ago

help!

cmsml_check_aot_compatibility model/ --serving-key predict

` Traceback (most recent call last): File "/usr/local/bin/cmsml_check_aot_compatibility", line 8, in sys.exit(main()) File "/usr/local/lib/python3.9/site-packages/cmsml/scripts/check_aot_compatibility.py", line 151, in main check_aot_compatibility( File "/usr/local/lib/python3.9/site-packages/cmsml/scripts/check_aot_compatibility.py", line 38, in check_aot_compatibility devices, ops = print_op_table(devices, filter_ops=op_names, table_format=table_format) File "/usr/local/lib/python3.9/site-packages/cmsml/scripts/check_aot_compatibility.py", line 67, in print_op_table ops = OpsData(devices) File "/usr/local/lib/python3.9/site-packages/cmsml/tensorflow/aot.py", line 50, in init self._determine_ops(devices) File "/usr/local/lib/python3.9/site-packages/cmsml/tensorflow/aot.py", line 155, in _determine_ops all_op_dicts = [ File "/usr/local/lib/python3.9/site-packages/cmsml/tensorflow/aot.py", line 156, in self.parse_ops_table(device=device) File "/usr/local/lib/python3.9/site-packages/cmsml/tensorflow/aot.py", line 94, in parse_ops_table table = cls.read_ops_table(device) File "/usr/local/lib/python3.9/site-packages/cmsml/tensorflow/aot.py", line 74, in read_ops_table raise Exception(f"tf2xla_supported_ops command failed with exit code {code}") Exception: tf2xla_supported_ops command failed with exit code 127

`

riga commented 3 days ago

Can you provide some more details and what you are doing, which (if any) CMSSW version you are using, etc? @Bogdan-Wiederspan

LinGeLin commented 3 days ago

Can you provide some more details and what you are doing, which (if any) CMSSW version you are using, etc? @Bogdan-Wiederspan

I am referring to this page to check if TensorFlow SavedModel supports AOT (Ahead-of-Time compilation). The image I am using is the one found on this page.

Both have been tried.

cmsml/cmsml:3.10

cmsml/cmsml:latest

LinGeLin commented 3 days ago

This command also does not work.

cmsml_compile_tf_graph model/ compiled_model -b 130 --input-serving-key predict

2024-10-23 13:14:38.594176: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0. 2024-10-23 13:14:38.632246: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX512F AVX512_VNNI AVX512_BF16 AVX512_FP16 AVX_VNNI AMX_TILE AMX_INT8 AMX_BF16 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2024-10-23 13:14:39.179730: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT WARNING:absl:20036 is not a valid tf.function parameter name. Sanitizing to arg_20036. WARNING:absl:14814 is not a valid tf.function parameter name. Sanitizing to arg_14814. WARNING:absl:3797 is not a valid tf.function parameter name. Sanitizing to arg_3797. WARNING:absl:20016 is not a valid tf.function parameter name. Sanitizing to arg_20016. WARNING:absl:20055 is not a valid tf.function parameter name. Sanitizing to arg_20055. Traceback (most recent call last): File "/usr/local/bin/cmsml_compile_tf_graph", line 8, in sys.exit(main()) File "/usr/local/lib/python3.9/site-packages/cmsml/scripts/compile_tf_graph.py", line 255, in main compile_tf_graph( File "/usr/local/lib/python3.9/site-packages/cmsml/scripts/compile_tf_graph.py", line 74, in compile_tf_graph for key, spec in model.signatures["serving_default"].structured_input_signature[1].items(): File "/usr/local/lib/python3.9/site-packages/tensorflow/python/saved_model/signature_serialization.py", line 302, in getitem return self._signatures[key] KeyError: 'serving_default'

🤔

valsdav commented 3 days ago

HI! Can you give us a bit more details please? Can you describe the model you are trying to convert and how you saved that? Have you defined the model with the TF library installed in the cmsml libraries?

Bogdan-Wiederspan commented 3 days ago

Hi, thank you for bringing this up, after reviewing the AOT conversion code, we believe we have an idea of what might be causing the issue in your case. To help us confirm, would it be possible for you to share the saved model you're trying to convert? We'd like to run a few tests to ensure we’re addressing the problem accurately.

Regarding the conversion table error, this error points to an issue with invoking the TensorFlow tool responsible for generating the table: tf2xla_supported_ops. This tool needs to be compiled along TensorFlow and I'm not sure if this is done in the pip version of TensorFlow. Could you tell me if you run this from within CMSSW or from your own environment?

LinGeLin commented 2 days ago

Hi, thank you for bringing this up, after reviewing the AOT conversion code, we believe we have an idea of what might be causing the issue in your case. To help us confirm, would it be possible for you to share the saved model you're trying to convert? We'd like to run a few tests to ensure we’re addressing the problem accurately.

Regarding the conversion table error, this error points to an issue with invoking the TensorFlow tool responsible for generating the table: tf2xla_supported_ops. This tool needs to be compiled along TensorFlow and I'm not sure if this is done in the pip version of TensorFlow. Could you tell me if you run this from within CMSSW or from your own environment?

Thank you for your response. This is my first attempt to use cmsml for AOT conversion. My model is a PS recommendation model, and I apologize, but after consulting with my leader, I cannot share it as the project is still under confidentiality. The model's training and saving were both done in our own TensorFlow environment. Now, I am trying to optimize the performance during the inference stage and want to give AOT a try. tfcompile seems to be quite complex to operate, while cmsml appears to be simpler. The conversion environment uses the two images mentioned above. No additional installations were made specifically for this purpose.

@valsdav @Bogdan-Wiederspan

Bogdan-Wiederspan commented 1 day ago

Hi, thank you for bringing this up, after reviewing the AOT conversion code, we believe we have an idea of what might be causing the issue in your case. To help us confirm, would it be possible for you to share the saved model you're trying to convert? We'd like to run a few tests to ensure we’re addressing the problem accurately. Regarding the conversion table error, this error points to an issue with invoking the TensorFlow tool responsible for generating the table: tf2xla_supported_ops. This tool needs to be compiled along TensorFlow and I'm not sure if this is done in the pip version of TensorFlow. Could you tell me if you run this from within CMSSW or from your own environment?

Thank you for your response. This is my first attempt to use cmsml for AOT conversion. My model is a PS recommendation model, and I apologize, but after consulting with my leader, I cannot share it as the project is still under confidentiality. The model's training and saving were both done in our own TensorFlow environment. Now, I am trying to optimize the performance during the inference stage and want to give AOT a try. tfcompile seems to be quite complex to operate, while cmsml appears to be simpler. The conversion environment uses the two images mentioned above. No additional installations were made specifically for this purpose.

@valsdav @Bogdan-Wiederspan

Hi again,

1) As i thought, the root of this is a minor bug in our compilation code where the input_serving_key isn’t passed through as intended, so it defaults to serving_default. Meaning this will only happen when your model uses a custom signature, like predict. I’ll start a PR shortly to address this and include a test to catch this edge case in the future. Thank you for pointing it out!

@LinGeLin, if you'd prefer not to wait for the PR, you can make a quick local fix. Just navigate to: https://github.com/cms-ml/cmsml/blob/master/cmsml/scripts/compile_tf_graph.py#L74

Then, replace model.signatures["serving_default"] with model.signatures[input_serving_key]. This should do the trick.

2) As for the compatibility check, it’s working as expected; however, the TensorFlow installed with pip doesn’t have the tf2xla_supported_ops tool compiled. My suggestion would be to run this within a TensorFlow Docker image that includes the necessary tools, and then pip install from within the cmsml package.