LeelaChessZero / lczero-training

For code etc relating to the network training process.
143 stars 119 forks source link

Implement Net-to-CoreML Conversion Script #222

Open ChinChangYang opened 8 months ago

ChinChangYang commented 8 months ago

This commit introduces the net_to_coreml.py script in the tf/ directory. The script facilitates the conversion of a neural network file into a TensorFlow model, followed by its transformation into a CoreML model. This process mirrors the TensorFlow model conversion methodology used in net_to_model.py.

Key features of the CoreML conversion include:

Test 1: 128x10 (PASS)

% python net_to_coreml.py --cfg 128x10.yaml-20210723-1032 weights_run2_744706.lc0          
TensorFlow version 2.15.0 has not been tested with coremltools. You may run into unexpected errors. TensorFlow 2.12.0 is the most recent version that has been tested.
dataset:
  allow_less_chunks: true
  input_test: dev2/test/
  input_train: dev2/train/
  input_validation: dev2/validate/
  num_chunks: 1000000
  train_ratio: 0.9
gpu: 0
model:
  filters: 128
  residual_blocks: 10
  se_ratio: 4
name: 128x10-t74
training:
  batch_size: 1024
  lr_boundaries:
  - 120
  lr_values:
  - 4.0e-05
  - 4.0e-05
  mask_legal_moves: true
  max_grad_norm: 5.4
  moves_left_loss_weight: 1.0
  num_batch_splits: 1
  num_test_positions: 40000
  path: dev2/networks
  policy_loss_weight: 1.0
  q_ratio: 0
  renorm: true
  renorm_max_d: 0.0
  renorm_max_r: 1.0
  shuffle_size: 500000
  swa: true
  swa_max_n: 10
  swa_output: true
  swa_steps: 100
  test_steps: 500
  total_steps: 2000
  train_avg_report_steps: 200
  validation_steps: 500
  value_focus_min: 1.0
  value_focus_slope: 0.0
  value_loss_weight: 2.0
  warmup_steps: 1000

[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
Wrote model to dev2/networks/128x10-t74/128x10-t74-0
Running TensorFlow Graph Passes: 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:00<00:00, 20.80 passes/s]
Converting TF Frontend ==> MIL Ops: 100%|██████████████████████████████████████████████████████████████████████████████████████| 413/413 [00:00<00:00, 11943.65 ops/s]
Running MIL frontend_tensorflow2 pipeline: 100%|█████████████████████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 1628.68 passes/s]
Running MIL default pipeline: 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 71/71 [00:00<00:00, 86.30 passes/s]
Running MIL backend_mlprogram pipeline: 100%|██████████████████████████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 1404.81 passes/s]
Input names: ['input_planes']
Output names: ['output_policy', 'output_value', 'output_moves_left']
Rebuilding model with updated spec ...
Saving model ...
CoreML model saved at dev2/networks/128x10-t74/weights_run2_744706.lc0.mlpackage

Test 2: 512x19 (FAILED)

% python net_to_coreml.py --cfg 512x19-t80.yaml-20230507-0216 512x19-t81-swa-10061000.pb.gz
TensorFlow version 2.15.0 has not been tested with coremltools. You may run into unexpected errors. TensorFlow 2.12.0 is the most recent version that has been tested.
dataset:
  allow_less_chunks: true
  input_test:
  - dev1/test/
  input_train:
  - dev1/train/
  input_validation: dev1/validate/
  num_chunks: 3000000
  test_workers: 8
  train_ratio: 0.9
  train_workers: 32
gpu: 0
model:
  default_activation: mish
  filters: 512
  pol_encoder_layers: 0
  policy: attention
  residual_blocks: 19
  se_ratio: 16
name: 512x19-t80
training:
  batch_size: 1024
  checkpoint_steps: 4000
  diff_focus_min: 0.025
  diff_focus_slope: 3.0
  lookahead_optimizer: true
  lr_boundaries:
  - 100
  lr_values:
  - 0.0004
  - 0.0004
  mask_legal_moves: true
  max_grad_norm: 4.0
  moves_left_loss_weight: 1.0
  num_batch_splits: 2
  num_test_positions: 40000
  path: dev1/networks
  policy_loss_weight: 1.0
  q_ratio: 0.0
  reg_term_weight: 0.05
  renorm: true
  renorm_max_d: 0.0
  renorm_max_r: 1.0
  shuffle_size: 500000
  swa: true
  swa_max_n: 10
  swa_output: true
  swa_steps: 100
  test_steps: 500
  total_steps: 500
  train_avg_report_steps: 200
  validation_steps: 500
  value_loss_weight: 1.0
  warmup_steps: 1000

[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
/Users/chinchangyang/miniconda3/envs/lczero-training-py3.11/lib/python3.11/site-packages/tensorflow_addons/utils/tfa_eol_msg.py:23: UserWarning: 

TensorFlow Addons (TFA) has ended development and introduction of new features.
TFA has entered a minimal maintenance and release mode until a planned end of life in May 2024.
Please modify downstream libraries to take dependencies from other repositories in our TensorFlow community (e.g. Keras, Keras-CV, and Keras-NLP). 

For more information see: https://github.com/tensorflow/addons/issues/2807 

  warnings.warn(
Wrote model to dev1/networks/512x19-t80/512x19-t80-0
Running TensorFlow Graph Passes: 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:00<00:00,  7.15 passes/s]
Converting TF Frontend ==> MIL Ops:  97%|███████████████████████████████████████████████████████████████████████████████████▍  | 960/989 [00:00<00:00, 11277.54 ops/s]
Traceback (most recent call last):
  File "/Users/chinchangyang/Code/lczero-training-ccy/tf/net_to_coreml.py", line 50, in <module>
    coreml_model = ct.convert(
                   ^^^^^^^^^^^
  File "/Users/chinchangyang/miniconda3/envs/lczero-training-py3.11/lib/python3.11/site-packages/coremltools/converters/_converters_entry.py", line 574, in convert
    mlmodel = mil_convert(
              ^^^^^^^^^^^^
  File "/Users/chinchangyang/miniconda3/envs/lczero-training-py3.11/lib/python3.11/site-packages/coremltools/converters/mil/converter.py", line 188, in mil_convert
    return _mil_convert(model, convert_from, convert_to, ConverterRegistry, MLModel, compute_units, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/chinchangyang/miniconda3/envs/lczero-training-py3.11/lib/python3.11/site-packages/coremltools/converters/mil/converter.py", line 212, in _mil_convert
    proto, mil_program = mil_convert_to_proto(
                         ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/chinchangyang/miniconda3/envs/lczero-training-py3.11/lib/python3.11/site-packages/coremltools/converters/mil/converter.py", line 286, in mil_convert_to_proto
    prog = frontend_converter(model, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/chinchangyang/miniconda3/envs/lczero-training-py3.11/lib/python3.11/site-packages/coremltools/converters/mil/converter.py", line 98, in __call__
    return tf2_loader.load()
           ^^^^^^^^^^^^^^^^^
  File "/Users/chinchangyang/miniconda3/envs/lczero-training-py3.11/lib/python3.11/site-packages/coremltools/converters/mil/frontend/tensorflow/load.py", line 82, in load
    program = self._program_from_tf_ssa()
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/chinchangyang/miniconda3/envs/lczero-training-py3.11/lib/python3.11/site-packages/coremltools/converters/mil/frontend/tensorflow2/load.py", line 210, in _program_from_tf_ssa
    return converter.convert()
           ^^^^^^^^^^^^^^^^^^^
  File "/Users/chinchangyang/miniconda3/envs/lczero-training-py3.11/lib/python3.11/site-packages/coremltools/converters/mil/frontend/tensorflow/converter.py", line 522, in convert
    self.convert_main_graph(prog, graph)
  File "/Users/chinchangyang/miniconda3/envs/lczero-training-py3.11/lib/python3.11/site-packages/coremltools/converters/mil/frontend/tensorflow/converter.py", line 421, in convert_main_graph
    outputs = convert_graph(self.context, graph, self.output_names)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/chinchangyang/miniconda3/envs/lczero-training-py3.11/lib/python3.11/site-packages/coremltools/converters/mil/frontend/tensorflow/convert_utils.py", line 191, in convert_graph
    add_op(context, node)
  File "/Users/chinchangyang/miniconda3/envs/lczero-training-py3.11/lib/python3.11/site-packages/coremltools/converters/mil/frontend/tensorflow/ops.py", line 1332, in RealDiv
    y = mb.cast(x=context[node.inputs[1]], dtype="fp32")
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/chinchangyang/miniconda3/envs/lczero-training-py3.11/lib/python3.11/site-packages/coremltools/converters/mil/mil/ops/registry.py", line 182, in add_op
    return cls._add_op(op_cls_to_add, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/chinchangyang/miniconda3/envs/lczero-training-py3.11/lib/python3.11/site-packages/coremltools/converters/mil/mil/builder.py", line 184, in _add_op
    new_op.type_value_inference()
  File "/Users/chinchangyang/miniconda3/envs/lczero-training-py3.11/lib/python3.11/site-packages/coremltools/converters/mil/mil/operation.py", line 260, in type_value_inference
    output_vals = self._auto_val(output_types)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/chinchangyang/miniconda3/envs/lczero-training-py3.11/lib/python3.11/site-packages/coremltools/converters/mil/mil/operation.py", line 377, in _auto_val
    vals = self.value_inference()
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/chinchangyang/miniconda3/envs/lczero-training-py3.11/lib/python3.11/site-packages/coremltools/converters/mil/mil/operation.py", line 111, in wrapper
    return func(self)
           ^^^^^^^^^^
  File "/Users/chinchangyang/miniconda3/envs/lczero-training-py3.11/lib/python3.11/site-packages/coremltools/converters/mil/mil/ops/defs/iOS15/elementwise_unary.py", line 868, in value_inference
    return self.get_cast_value(self.x, self.dtype.val)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/chinchangyang/miniconda3/envs/lczero-training-py3.11/lib/python3.11/site-packages/coremltools/converters/mil/mil/ops/defs/iOS15/elementwise_unary.py", line 894, in get_cast_value
    return input_var.val.astype(dtype=string_to_nptype(dtype_val))
           ^^^^^^^^^^^^^^^^^^^^
AttributeError: 'float' object has no attribute 'astype'

The error message is similar with this issue. https://github.com/apple/coremltools/issues/1768

ChinChangYang commented 8 months ago

Regarding with the AttributeError, it can be fixed by the following diff for the coremltools source code:

% git diff --cached coremltools
diff --git a/coremltools/converters/mil/mil/ops/defs/iOS15/elementwise_unary.py b/coremltools/converters/mil/mil/ops/defs/iOS15/elementwise_unary.py
index c5ebc40..fb6902f 100644
--- a/coremltools/converters/mil/mil/ops/defs/iOS15/elementwise_unary.py
+++ b/coremltools/converters/mil/mil/ops/defs/iOS15/elementwise_unary.py
@@ -890,7 +890,7 @@ class cast(Operation):
                 return np.array(result)
             return None

-        if not types.is_tensor(input_var.sym_type):
-            return input_var.val.astype(dtype=string_to_nptype(dtype_val))
-        else:
+        if isinstance(input_var.val, float) or types.is_tensor(input_var.sym_type):
             return np.array(input_var.val).astype(dtype=string_to_nptype(dtype_val))
+        else:
+            return input_var.val.astype(dtype=string_to_nptype(dtype_val))

I am running coremltools test suites. I will create a pull request in coremltools GitHub repository. If the pull request is accepted, hopefully a new coremltools release includes this fix.

ChinChangYang commented 7 months ago

Test 2: 512x19 (PASSED)

% python net_to_coreml.py --cfg 512x19-t80.yaml-20230507-0216 512x19-t81-swa-10061000.pb.gz
TensorFlow version 2.15.0 has not been tested with coremltools. You may run into unexpected errors. TensorFlow 2.12.0 is the most recent version that has been tested.
dataset:
  allow_less_chunks: true
  input_test:
  - dev1/test/
  input_train:
  - dev1/train/
  input_validation: dev1/validate/
  num_chunks: 3000000
  test_workers: 8
  train_ratio: 0.9
  train_workers: 32
gpu: 0
model:
  default_activation: mish
  filters: 512
  pol_encoder_layers: 0
  policy: attention
  residual_blocks: 19
  se_ratio: 16
name: 512x19-t80
training:
  batch_size: 1024
  checkpoint_steps: 4000
  diff_focus_min: 0.025
  diff_focus_slope: 3.0
  lookahead_optimizer: true
  lr_boundaries:
  - 100
  lr_values:
  - 0.0004
  - 0.0004
  mask_legal_moves: true
  max_grad_norm: 4.0
  moves_left_loss_weight: 1.0
  num_batch_splits: 2
  num_test_positions: 40000
  path: dev1/networks
  policy_loss_weight: 1.0
  q_ratio: 0.0
  reg_term_weight: 0.05
  renorm: true
  renorm_max_d: 0.0
  renorm_max_r: 1.0
  shuffle_size: 500000
  swa: true
  swa_max_n: 10
  swa_output: true
  swa_steps: 100
  test_steps: 500
  total_steps: 500
  train_avg_report_steps: 200
  validation_steps: 500
  value_loss_weight: 1.0
  warmup_steps: 1000

[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
/Users/chinchangyang/miniconda3/envs/lczero-training-py3.11/lib/python3.11/site-packages/tensorflow_addons/utils/tfa_eol_msg.py:23: UserWarning: 

TensorFlow Addons (TFA) has ended development and introduction of new features.
TFA has entered a minimal maintenance and release mode until a planned end of life in May 2024.
Please modify downstream libraries to take dependencies from other repositories in our TensorFlow community (e.g. Keras, Keras-CV, and Keras-NLP). 

For more information see: https://github.com/tensorflow/addons/issues/2807 

  warnings.warn(
Wrote model to dev1/networks/512x19-t80/512x19-t80-0
Running TensorFlow Graph Passes: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:00<00:00,  9.77 passes/s]
Converting TF Frontend ==> MIL Ops: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 987/987 [00:00<00:00, 13041.75 ops/s]
Running MIL frontend_tensorflow2 pipeline: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 788.02 passes/s]
Running MIL default pipeline: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 71/71 [00:03<00:00, 20.14 passes/s]
Running MIL backend_mlprogram pipeline: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 805.74 passes/s]
Input names: ['input_planes']
Output names: ['output_policy', 'output_value', 'output_moves_left']
Rebuilding model with updated spec ...
Saving model ...
CoreML model saved at dev1/networks/512x19-t80/512x19-t81-swa-10061000.pb.gz.mlpackage

The AttributeError has been resolved in https://github.com/apple/coremltools/pull/2087.

ChinChangYang commented 6 months ago

Unable to convert the 11248.pb.gz net into a model by net_to_model.py. The issue has been described in https://github.com/LeelaChessZero/lczero-training/issues/224.