Unity-Technologies / ml-agents

The Unity Machine Learning Agents Toolkit (ML-Agents) is an open-source project that enables games and simulations to serve as environments for training intelligent agents using deep reinforcement learning and imitation learning.
https://unity.com/products/machine-learning-agents
Other
16.93k stars 4.14k forks source link

Issues converting the model to TFLite #2845

Closed dav1d-wright closed 4 years ago

dav1d-wright commented 4 years ago

Hi

The goal of my current project is to run a model on an embedded system with limited computational resources. Therefore I am currently investigating the possibility of using TFLite for Microcontrollers. According to the command line help, the supported formats for the converter are SavedModel and keras models:

$ tflite_convert --help
usage: tflite_convert [-h] --output_file OUTPUT_FILE
                      (--saved_model_dir SAVED_MODEL_DIR | --keras_model_file KERAS_MODEL_FILE)

Command line tool to run TensorFlow Lite Converter.

optional arguments:
  -h, --help            show this help message and exit
  --output_file OUTPUT_FILE
                        Full filepath of the output file.
  --saved_model_dir SAVED_MODEL_DIR
                        Full path of the directory containing the SavedModel.
  --keras_model_file KERAS_MODEL_FILE
                        Full filepath of HDF5 file containing tf.Keras model.

Is it possible to convert the .nn model into these formats, e.g. with Barracuda?

I am certainly open to other suggestions on how to get the model to run on a bare metal system!

Thanks in advance!

harperj commented 4 years ago

Hi @wrd90 -- I can't speak to whether this will work how you expect, but the file output as frozen_graph_def.pb in the models folder should be the Tensorflow graph before conversion to the Barracuda .nn format.

dav1d-wright commented 4 years ago

@harperj Thank you. With an older version of Tensorflow there seems to be the possibility to convert the frozen graph file.

I have been trying to convert frozen_graph_def.pb but there seem to be some issues with this.

When I try the following: tflite_convert --output_file=model.tflite --graph_def_file=unity/ml-agents/models/ppo-0/RoverLearning/frozen_graph_def.pb --input_arrays=vector_observation,epsilon,recurrent_in,sequence_length --output_arrays=action

I get the output:

2019-11-05 13:59:21.589574: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-11-05 13:59:21.602134: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7fe2bf9e7280 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2019-11-05 13:59:21.602151: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
Traceback (most recent call last):
  File "/Users/dwright/anaconda3/envs/mlagents/bin/tflite_convert", line 8, in <module>
    sys.exit(main())
  File "/Users/dwright/anaconda3/envs/mlagents/lib/python3.6/site-packages/tensorflow_core/lite/python/tflite_convert.py", line 515, in main
    app.run(main=run_main, argv=sys.argv[:1])
  File "/Users/dwright/anaconda3/envs/mlagents/lib/python3.6/site-packages/tensorflow_core/python/platform/app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "/Users/dwright/anaconda3/envs/mlagents/lib/python3.6/site-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/Users/dwright/anaconda3/envs/mlagents/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "/Users/dwright/anaconda3/envs/mlagents/lib/python3.6/site-packages/tensorflow_core/lite/python/tflite_convert.py", line 511, in run_main
    _convert_tf1_model(tflite_flags)
  File "/Users/dwright/anaconda3/envs/mlagents/lib/python3.6/site-packages/tensorflow_core/lite/python/tflite_convert.py", line 199, in _convert_tf1_model
    output_data = converter.convert()
  File "/Users/dwright/anaconda3/envs/mlagents/lib/python3.6/site-packages/tensorflow_core/lite/python/lite.py", line 898, in convert
    self._set_batch_size(batch_size=1)
  File "/Users/dwright/anaconda3/envs/mlagents/lib/python3.6/site-packages/tensorflow_core/lite/python/lite.py", line 1032, in _set_batch_size
    shape = tensor.shape.as_list()
  File "/Users/dwright/anaconda3/envs/mlagents/lib/python3.6/site-packages/tensorflow_core/python/framework/tensor_shape.py", line 1171, in as_list
    raise ValueError("as_list() is not defined on an unknown TensorShape.")
ValueError: as_list() is not defined on an unknown TensorShape.

Using this, I added some debut messages to lite.py and found that the issue is with the variable sequence_length which is defined in the class LearningModel as:

        self.sequence_length = tf.placeholder(
            shape=None, dtype=tf.int32, name="sequence_length"
        )

I have been trying to redefine shape=1 or shape=[] but nothing seems to work here.

Do you have an idea how to resolve this?

dav1d-wright commented 4 years ago

If I pass the shapes to the command like this:

tflite_convert --output_file=model.tflite --graph_def_file=unity/ml-agents/models/ppo-0/RoverLearning/frozen_graph_def.pb --input_arrays=vector_observation,epsilon,recurrent_in,sequence_length --output_arrays=action --input_shapes=1,7:1,2:7,40:1

I get a similar error:

2019-11-05 14:25:44.658772: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-11-05 14:25:44.677569: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7fee79ed3960 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2019-11-05 14:25:44.677591: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2019-11-05 14:25:44.741840: I tensorflow/core/grappler/devices.cc:60] Number of eligible GPUs (core count >= 8, compute capability >= 0.0): 0 (Note: TensorFlow was not compiled with CUDA support)
2019-11-05 14:25:44.741948: I tensorflow/core/grappler/clusters/single_machine.cc:356] Starting new session
2019-11-05 14:25:44.819310: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:786] Optimization results for grappler item: graph_to_optimize
2019-11-05 14:25:44.819337: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:788]   constant_folding: Graph size after: 304 nodes (-81), 371 edges (-83), time = 56.212ms.
2019-11-05 14:25:44.819342: I tensorflow/core/grappler/optimizers/meta_optimizer.cc:788]   constant_folding: Graph size after: 304 nodes (0), 371 edges (0), time = 8.713ms.
Traceback (most recent call last):
  File "/Users/dwright/anaconda3/envs/mlagents/bin/tflite_convert", line 8, in <module>
    sys.exit(main())
  File "/Users/dwright/anaconda3/envs/mlagents/lib/python3.6/site-packages/tensorflow_core/lite/python/tflite_convert.py", line 515, in main
    app.run(main=run_main, argv=sys.argv[:1])
  File "/Users/dwright/anaconda3/envs/mlagents/lib/python3.6/site-packages/tensorflow_core/python/platform/app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "/Users/dwright/anaconda3/envs/mlagents/lib/python3.6/site-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/Users/dwright/anaconda3/envs/mlagents/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "/Users/dwright/anaconda3/envs/mlagents/lib/python3.6/site-packages/tensorflow_core/lite/python/tflite_convert.py", line 511, in run_main
    _convert_tf1_model(tflite_flags)
  File "/Users/dwright/anaconda3/envs/mlagents/lib/python3.6/site-packages/tensorflow_core/lite/python/tflite_convert.py", line 199, in _convert_tf1_model
    output_data = converter.convert()
  File "/Users/dwright/anaconda3/envs/mlagents/lib/python3.6/site-packages/tensorflow_core/lite/python/lite.py", line 983, in convert
    **converter_kwargs)
  File "/Users/dwright/anaconda3/envs/mlagents/lib/python3.6/site-packages/tensorflow_core/lite/python/convert.py", line 449, in toco_convert_impl
    enable_mlir_converter=enable_mlir_converter)
  File "/Users/dwright/anaconda3/envs/mlagents/lib/python3.6/site-packages/tensorflow_core/lite/python/convert.py", line 200, in toco_convert_protos
    raise ConverterError("See console for info.\n%s\n%s\n" % (stdout, stderr))
tensorflow.lite.python.convert.ConverterError: See console for info.
2019-11-05 14:25:46.811578: I tensorflow/lite/toco/import_tensorflow.cc:659] Converting unsupported operation: Enter
2019-11-05 14:25:46.811623: I tensorflow/lite/toco/import_tensorflow.cc:659] Converting unsupported operation: Enter
2019-11-05 14:25:46.811639: I tensorflow/lite/toco/import_tensorflow.cc:659] Converting unsupported operation: Enter
2019-11-05 14:25:46.811649: I tensorflow/lite/toco/import_tensorflow.cc:659] Converting unsupported operation: Enter
2019-11-05 14:25:46.811675: I tensorflow/lite/toco/import_tensorflow.cc:659] Converting unsupported operation: Enter
2019-11-05 14:25:46.811685: I tensorflow/lite/toco/import_tensorflow.cc:659] Converting unsupported operation: Enter
2019-11-05 14:25:46.811695: I tensorflow/lite/toco/import_tensorflow.cc:659] Converting unsupported operation: Enter
2019-11-05 14:25:46.811705: I tensorflow/lite/toco/import_tensorflow.cc:659] Converting unsupported operation: Enter
2019-11-05 14:25:46.811986: I tensorflow/lite/toco/import_tensorflow.cc:659] Converting unsupported operation: Enter
2019-11-05 14:25:46.812009: I tensorflow/lite/toco/import_tensorflow.cc:659] Converting unsupported operation: Enter
2019-11-05 14:25:46.812022: I tensorflow/lite/toco/import_tensorflow.cc:659] Converting unsupported operation: Enter
2019-11-05 14:25:46.812033: I tensorflow/lite/toco/import_tensorflow.cc:659] Converting unsupported operation: Enter
2019-11-05 14:25:46.812378: I tensorflow/lite/toco/import_tensorflow.cc:659] Converting unsupported operation: TensorArrayV3
2019-11-05 14:25:46.812411: I tensorflow/lite/toco/import_tensorflow.cc:193] Unsupported data type in placeholder op: 20
2019-11-05 14:25:46.812433: I tensorflow/lite/toco/import_tensorflow.cc:659] Converting unsupported operation: TensorArrayV3
2019-11-05 14:25:46.812448: I tensorflow/lite/toco/import_tensorflow.cc:193] Unsupported data type in placeholder op: 20
2019-11-05 14:25:46.812468: I tensorflow/lite/toco/import_tensorflow.cc:659] Converting unsupported operation: Enter
2019-11-05 14:25:46.812497: I tensorflow/lite/toco/import_tensorflow.cc:659] Converting unsupported operation: TensorArrayV3
2019-11-05 14:25:46.812507: I tensorflow/lite/toco/import_tensorflow.cc:193] Unsupported data type in placeholder op: 20
2019-11-05 14:25:46.812525: I tensorflow/lite/toco/import_tensorflow.cc:659] Converting unsupported operation: Enter
2019-11-05 14:25:46.812544: I tensorflow/lite/toco/import_tensorflow.cc:659] Converting unsupported operation: Enter
2019-11-05 14:25:46.812559: I tensorflow/lite/toco/import_tensorflow.cc:659] Converting unsupported operation: Enter
2019-11-05 14:25:46.812567: I tensorflow/lite/toco/import_tensorflow.cc:193] Unsupported data type in placeholder op: 20
2019-11-05 14:25:46.812580: I tensorflow/lite/toco/import_tensorflow.cc:659] Converting unsupported operation: Enter
2019-11-05 14:25:46.812588: I tensorflow/lite/toco/import_tensorflow.cc:193] Unsupported data type in placeholder op: 20
2019-11-05 14:25:46.812613: I tensorflow/lite/toco/import_tensorflow.cc:659] Converting unsupported operation: TensorArrayScatterV3
2019-11-05 14:25:46.812631: I tensorflow/lite/toco/import_tensorflow.cc:659] Converting unsupported operation: Enter
2019-11-05 14:25:46.812640: I tensorflow/lite/toco/import_tensorflow.cc:193] Unsupported data type in placeholder op: 20
2019-11-05 14:25:46.812661: I tensorflow/lite/toco/import_tensorflow.cc:659] Converting unsupported operation: TensorArrayScatterV3
2019-11-05 14:25:46.812680: I tensorflow/lite/toco/import_tensorflow.cc:659] Converting unsupported operation: Enter
2019-11-05 14:25:46.812694: I tensorflow/lite/toco/import_tensorflow.cc:659] Converting unsupported operation: Enter
2019-11-05 14:25:46.812708: I tensorflow/lite/toco/import_tensorflow.cc:659] Converting unsupported operation: Enter
2019-11-05 14:25:46.812724: I tensorflow/lite/toco/import_tensorflow.cc:659] Converting unsupported operation: Enter
2019-11-05 14:25:46.812751: I tensorflow/lite/toco/import_tensorflow.cc:659] Converting unsupported operation: LoopCond
2019-11-05 14:25:46.812763: I tensorflow/lite/toco/import_tensorflow.cc:659] Converting unsupported operation: LoopCond
2019-11-05 14:25:46.812864: I tensorflow/lite/toco/import_tensorflow.cc:659] Converting unsupported operation: Exit
2019-11-05 14:25:46.812928: I tensorflow/lite/toco/import_tensorflow.cc:659] Converting unsupported operation: TensorArrayReadV3
2019-11-05 14:25:46.812951: I tensorflow/lite/toco/import_tensorflow.cc:659] Converting unsupported operation: TensorArraySizeV3
2019-11-05 14:25:46.813037: I tensorflow/lite/toco/import_tensorflow.cc:659] Converting unsupported operation: TensorArrayReadV3
2019-11-05 14:25:46.813111: I tensorflow/lite/toco/import_tensorflow.cc:659] Converting unsupported operation: TensorArrayGatherV3
2019-11-05 14:25:46.813301: I tensorflow/lite/toco/import_tensorflow.cc:659] Converting unsupported operation: TensorArrayWriteV3
2019-11-05 14:25:46.816417: I tensorflow/lite/toco/graph_transformations/graph_transformations.cc:39] Before Removing unused ops: 185 operators, 322 arrays (0 quantized)
2019-11-05 14:25:46.818204: I tensorflow/lite/toco/graph_transformations/graph_transformations.cc:39] After Removing unused ops pass 1: 103 operators, 180 arrays (0 quantized)
2019-11-05 14:25:46.819853: I tensorflow/lite/toco/graph_transformations/graph_transformations.cc:39] Before general graph transformations: 103 operators, 180 arrays (0 quantized)
2019-11-05 14:25:46.819987: F tensorflow/lite/toco/graph_transformations/propagate_fixed_sizes.cc:1616] Check failed: *packed_shape == shape All input arrays to Pack operators must have the same shape. Input "sequence_length" is different.
Fatal Python error: Aborted

Current thread 0x0000000111d5b5c0 (most recent call first):
  File "/Users/dwright/anaconda3/envs/mlagents/lib/python3.6/site-packages/tensorflow_core/lite/toco/python/toco_from_protos.py", line 52 in execute
  File "/Users/dwright/anaconda3/envs/mlagents/lib/python3.6/site-packages/absl/app.py", line 250 in _run_main
  File "/Users/dwright/anaconda3/envs/mlagents/lib/python3.6/site-packages/absl/app.py", line 299 in run
  File "/Users/dwright/anaconda3/envs/mlagents/lib/python3.6/site-packages/tensorflow_core/python/platform/app.py", line 40 in run
  File "/Users/dwright/anaconda3/envs/mlagents/lib/python3.6/site-packages/tensorflow_core/lite/toco/python/toco_from_protos.py", line 89 in main
  File "/Users/dwright/anaconda3/envs/mlagents/bin/toco_from_protos", line 8 in <module>
dav1d-wright commented 4 years ago

Update: if I disable the RNN by setting use_recurrent:false the conversion works, but unfortunately I need a recurrent network.

harperj commented 4 years ago

Hi @wrd90 -- I'm not sure why TFLite wouldn't convert this model. It sounds like you're right that it has something to do with the unspecified / variable shape for the sequence. You might try posting an issue for TFLite since they'd have more information about what is and isn't possible to convert.

dav1d-wright commented 4 years ago

Thank you @harperj, I posted an issue to the tensorflow repo, I hope I'll find a solution :)

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had activity in the last 14 days. It will be closed in the next 14 days if no further activity occurs. Thank you for your contributions.

dav1d-wright commented 4 years ago

Thanks @stale bot for reminding me to update my findings here.

According to this site, RNNs are not yet supported, but support is planned "by the end of 2019". I expect this support to be added only to TF2, therefore I reckon ML-Agents will need to migrate to TF2 as well before this is possible... To my knowledge this is a significant piece of work, especially what concerns saving of models and graphs. Up to now I have not been able to convert the frozen graph (which is no longer supported in TF2) to a SavedModel, because there is a fair bit of coding necessary in the model generation. Due to time restrictions for my work I had to move on to another solution for now.

harperj commented 4 years ago

Thanks for the update @wrd90. We're keeping an eye on TF2 and evaluating when it would make sense to upgrade. Good luck with your project!

dav1d-wright commented 4 years ago

Thank you @harperj :) for now I'm pursuing a workaround where I'm manually buffering past observations and actions to feed them into the network. Maybe this might help other people in the meantime.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had activity in the last 14 days. It will be closed in the next 14 days if no further activity occurs. Thank you for your contributions.

stale[bot] commented 4 years ago

This issue has been automatically closed because it has not had activity in the last 28 days. If this issue is still valid, please ping a maintainer. Thank you for your contributions.

github-actions[bot] commented 3 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.