FedCampus / FedKit

Mobile Federated Learning development kit for FedCampus
MIT License
13 stars 7 forks source link

convert_model.mnist_eg.tf errors, possibly due to different TF versions #35

Open theta-lin opened 7 months ago

theta-lin commented 7 months ago
> python3 --version
Python 3.11.8
> pip list | grep tensorflow
tensorflow                   2.16.1

I encountered errors executing python3 -m convert_model.mnist_eg.tf. Since the version of tensorflow is unspecified in convert_model/requirements.txt, perhaps tensorflow works differently in my version. (The version of coremltools is also unspecified, which might also be problematic)

2024-04-17 15:25:44.036395: I external/local_tsl/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2024-04-17 15:25:44.040311: I external/local_tsl/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2024-04-17 15:25:44.091677: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-04-17 15:25:46.018733: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2024-04-17 15:25:47.542628: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-04-17 15:25:47.544149: W tensorflow/core/common_runtime/gpu/gpu_device.cc:2251] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/data/dku/fedcampus/FedKit/convert_model/mnist_eg/tf.py", line 32, in <module>
    tflite()
  File "/data/dku/fedcampus/FedKit/convert_model/mnist_eg/tf.py", line 25, in tflite
    save_model(model, SAVED_MODEL_DIR)
  File "/data/dku/fedcampus/FedKit/convert_model/tflite.py", line 66, in save_model
    parameters = model.parameters.get_concrete_function()
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/dku/fedcampus/FedKit/backend/.venv/lib/python3.11/site-packages/tensorflow/python/eager/polymorphic_function/polymorphic_function.py", line 1251, in get_concrete_function
    concrete = self._get_concrete_function_garbage_collected(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/dku/fedcampus/FedKit/backend/.venv/lib/python3.11/site-packages/tensorflow/python/eager/polymorphic_function/polymorphic_function.py", line 1221, in _get_concrete_function_garbage_collected
    self._initialize(args, kwargs, add_initializers_to=initializers)
  File "/data/dku/fedcampus/FedKit/backend/.venv/lib/python3.11/site-packages/tensorflow/python/eager/polymorphic_function/polymorphic_function.py", line 696, in _initialize
    self._concrete_variable_creation_fn = tracing_compilation.trace_function(
                                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/dku/fedcampus/FedKit/backend/.venv/lib/python3.11/site-packages/tensorflow/python/eager/polymorphic_function/tracing_compilation.py", line 178, in trace_function
    concrete_function = _maybe_define_function(
                        ^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/dku/fedcampus/FedKit/backend/.venv/lib/python3.11/site-packages/tensorflow/python/eager/polymorphic_function/tracing_compilation.py", line 283, in _maybe_define_function
    concrete_function = _create_concrete_function(
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/dku/fedcampus/FedKit/backend/.venv/lib/python3.11/site-packages/tensorflow/python/eager/polymorphic_function/tracing_compilation.py", line 310, in _create_concrete_function
    traced_func_graph = func_graph_module.func_graph_from_py_func(
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/dku/fedcampus/FedKit/backend/.venv/lib/python3.11/site-packages/tensorflow/python/framework/func_graph.py", line 1059, in func_graph_from_py_func
    func_outputs = python_func(*func_args, **func_kwargs)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/dku/fedcampus/FedKit/backend/.venv/lib/python3.11/site-packages/tensorflow/python/eager/polymorphic_function/polymorphic_function.py", line 599, in wrapped_fn
    out = weak_wrapped_fn().__wrapped__(*args, **kwds)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/dku/fedcampus/FedKit/backend/.venv/lib/python3.11/site-packages/tensorflow/python/eager/polymorphic_function/polymorphic_function.py", line 1719, in bound_method_wrapper
    return wrapped_fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/dku/fedcampus/FedKit/backend/.venv/lib/python3.11/site-packages/tensorflow/python/eager/polymorphic_function/autograph_util.py", line 52, in autograph_handler
    raise e.ag_error_metadata.to_exception(e)
  File "/data/dku/fedcampus/FedKit/backend/.venv/lib/python3.11/site-packages/tensorflow/python/eager/polymorphic_function/autograph_util.py", line 41, in autograph_handler
    return api.converted_call(
           ^^^^^^^^^^^^^^^^^^^
  File "/data/dku/fedcampus/FedKit/backend/.venv/lib/python3.11/site-packages/tensorflow/python/autograph/impl/api.py", line 439, in converted_call
    result = converted_f(*effective_args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/__autograph_generated_file__ud6qml.py", line 12, in tf__parameters
    retval_ = {f'a{ag__.ld(index)}': ag__.converted_call(ag__.ld(weight).read_value, (), None, fscope) for index, weight in ag__.converted_call(ag__.ld(enumerate), (ag__.ld(self).model.weights,), None, fscope)}
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/__autograph_generated_file__ud6qml.py", line 12, in <dictcomp>
    retval_ = {f'a{ag__.ld(index)}': ag__.converted_call(ag__.ld(weight).read_value, (), None, fscope) for index, weight in ag__.converted_call(ag__.ld(enumerate), (ag__.ld(self).model.weights,), None, fscope)}
                                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: in user code:

    File "/data/dku/fedcampus/FedKit/convert_model/tflite.py", line 31, in parameters  *
        f"a{index}": weight.read_value()

    AttributeError: 'Variable' object has no attribute 'read_value'

After changing https://github.com/FedCampus/FedKit/blob/e203312add2c9fc1ebb5511bae8a52eb384814c4/convert_model/tflite.py#L31 to

f"a{index}": weight.value.read_value()

this error is resolved, but I then encountered another error

2024-04-17 15:54:33.916521: I external/local_tsl/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2024-04-17 15:54:33.920357: I external/local_tsl/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2024-04-17 15:54:33.969224: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-04-17 15:54:35.695073: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2024-04-17 15:54:37.321051: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-04-17 15:54:37.322647: W tensorflow/core/common_runtime/gpu/gpu_device.cc:2251] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/data/dku/fedcampus/FedKit/convert_model/mnist_eg/tf.py", line 32, in <module>
    tflite()
  File "/data/dku/fedcampus/FedKit/convert_model/mnist_eg/tf.py", line 25, in tflite
    save_model(model, SAVED_MODEL_DIR)
  File "/data/dku/fedcampus/FedKit/convert_model/tflite.py", line 71, in save_model
    tf.saved_model.save(
  File "/data/dku/fedcampus/FedKit/backend/.venv/lib/python3.11/site-packages/tensorflow/python/saved_model/save.py", line 1392, in save
    save_and_return_nodes(obj, export_dir, signatures, options)
  File "/data/dku/fedcampus/FedKit/backend/.venv/lib/python3.11/site-packages/tensorflow/python/saved_model/save.py", line 1427, in save_and_return_nodes
    _build_meta_graph(obj, signatures, options, meta_graph_def))
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/dku/fedcampus/FedKit/backend/.venv/lib/python3.11/site-packages/tensorflow/python/saved_model/save.py", line 1642, in _build_meta_graph
    return _build_meta_graph_impl(obj, signatures, options, meta_graph_def)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/dku/fedcampus/FedKit/backend/.venv/lib/python3.11/site-packages/tensorflow/python/saved_model/save.py", line 1566, in _build_meta_graph_impl
    asset_info, exported_graph = _fill_meta_graph_def(
                                 ^^^^^^^^^^^^^^^^^^^^^
  File "/data/dku/fedcampus/FedKit/backend/.venv/lib/python3.11/site-packages/tensorflow/python/saved_model/save.py", line 933, in _fill_meta_graph_def
    signatures = _generate_signatures(signature_functions, object_map, defaults)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/dku/fedcampus/FedKit/backend/.venv/lib/python3.11/site-packages/tensorflow/python/saved_model/save.py", line 655, in _generate_signatures
    outputs = object_map[function](**{
              ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/dku/fedcampus/FedKit/backend/.venv/lib/python3.11/site-packages/tensorflow/python/eager/polymorphic_function/saved_model_exported_concrete.py", line 45, in __call__
    export_captures = _map_captures_to_created_tensors(
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/dku/fedcampus/FedKit/backend/.venv/lib/python3.11/site-packages/tensorflow/python/eager/polymorphic_function/saved_model_exported_concrete.py", line 74, in _map_captures_to_created_tensors
    _raise_untracked_capture_error(function.name, exterior, interior)
  File "/data/dku/fedcampus/FedKit/backend/.venv/lib/python3.11/site-packages/tensorflow/python/eager/polymorphic_function/saved_model_exported_concrete.py", line 98, in _raise_untracked_capture_error
    raise AssertionError(msg)
AssertionError: Tried to export a function which references an 'untracked' resource. TensorFlow objects (e.g. tf.Variable) captured by functions must be 'tracked' by assigning them to an attribute of a tracked object or assigned to an attribute of the main object directly. See the information below:
    Function name = b'__inference_signature_wrapper_1685'
    Captured Tensor = <ResourceHandle(name="loss/total/10", device="/job:localhost/replica:0/task:0/device:CPU:0", container="Anonymous", type="tensorflow::Var", dtype and shapes : "[ DType enum: 1, Shape: [] ]")>
    Trackable referencing this tensor = <tf.Variable 'loss/total:0' shape=() dtype=float32>
    Internal Tensor = Tensor("1637:0", shape=(), dtype=resource)
8 restore test results.

According to answers such as https://stackoverflow.com/questions/73416907/model-save-tried-to-export-a-function-which-references-untracked-resource-eve, the use of static members might cause this problem, but I don't know how to fix it in this project's case.

SichangHe commented 6 months ago

Did reproduce. Unfortunately, I don't know how to solve this.

Ideas:

  1. Try with the version here: https://github.com/adap/flower/blob/main/examples/android-kotlin/gen_tflite/pyproject.toml
  2. Check out TFLite's latest on-device training example and compare it to the 2022 one (which this converter is based on).