aws-neuron / aws-neuron-sdk

Powering AWS purpose-built machine learning chips. Blazing fast and cost effective, natively integrated into PyTorch and TensorFlow and integrated with your favorite AWS services
https://aws.amazon.com/machine-learning/neuron/
Other
456 stars 153 forks source link

Yolov4 neuron compilation with custom model raises: ValueError: Input 1 of node StatefulPartitionedCall was passed float from conv2d/kernel:0 incompatible with expected resource. #191

Closed ferranmartinezlleida closed 3 years ago

ferranmartinezlleida commented 4 years ago

Hello, thanks for all your nice tutorials on the sdk. I've followed quite some without any problems (the ones from the docker part and ResNet50) but when I tried to follow yolov4 neuron compilation tutorial here: https://github.com/aws/aws-neuron-sdk/blob/master/src/examples/tensorflow/yolo_v4_demo/evaluate.ipynb, I've encountered an error I'm not able to solve.

I've tried to use this script, with some modifications regarding the directories to allocate my custom model:

import shutil
import tensorflow as tf
import tensorflow.neuron as tfn

def no_fuse_condition(op):
    return any(op.name.startswith(pat) for pat in ['reshape', 'lambda_1/Cast', 'lambda_2/Cast', 'lambda_3/Cast'])`

with tf.Session(graph=tf.Graph()) as sess:
    tf.saved_model.loader.load(sess, ['serve'], './ws_yolov4/yolov4-416')
    no_fuse_ops = [op.name for op in sess.graph.get_operations() if no_fuse_condition(op)]
shutil.rmtree('./ws_yolov4/yolov4-416_neuron', ignore_errors=True)
result = tfn.saved_model.compile(
    './ws_yolov4/yolov4-416', './ws_yolov4/yolov4-416_neuron',
    # we partition the graph before casting from float16 to float32, to help reduce the output tensor size by 1/2
    no_fuse_ops=no_fuse_ops,
    # to enforce trivial compilable subgraphs to run on CPU
    minimum_segment_size=100,
    batch_size=1,
    dynamic_batch_size=True,
)
print(result)

When I execute the script I get the following error:

ValueError: Input 1 of node StatefulPartitionedCall was passed float from conv2d/kernel:0 incompatible with expected resource.

I tried also modifiying the compile_resnet50 in order to put my custom model through but I get the same error:

import os
import time
import shutil
import tensorflow as tf
import tensorflow.neuron as tfn
import tensorflow.compat.v1.keras as keras
from tensorflow.keras.applications.resnet50 import ResNet50
from tensorflow.keras.applications.resnet50 import preprocess_input

WORKSPACE = './ws_yolov4'
os.makedirs(WORKSPACE, exist_ok=True)

model_dir = os.path.join(WORKSPACE, 'yolov4-416')
compiled_model_dir = os.path.join(WORKSPACE, 'yolov4-416_neuron')

keras.backend.set_learning_phase(0)
keras.backend.set_image_data_format('channels_last')

tfn.saved_model.compile(model_dir, compiled_model_dir)

shutil.make_archive('./resnet50_neuron', 'zip', WORKSPACE, 'resnet50_neuron')

What I can assure is that the .pb model at yolov4-416 works, I've been able to do detections. Caracteristics from the model are:

classes = 3
anchors = yolov4 default
width,height = 416
filters (in convolutional layer before yolo layers) = 24

From the enviroment part, I'm using this Ubuntu 18.04 on a inf1.xlarge machine with all the steps done and verified described here: https://github.com/aws/aws-neuron-sdk/blob/master/docs/neuron-install-guide.md

Maybe I'm doing something wrong when I load the model previous to the compilation. Could you give me some directions? Maybe I have to do some previous steps with my model before I start the compilation, or maybe I'm loading it wrong. Thank you!

The full error trace for the first piece of code:

Traceback (most recent call last):
  File "graph_yolov4.py", line 20, in <module>
    dynamic_batch_size=True,
  File "/home/ubuntu/test_venv/lib/python3.7/site-packages/tensorflow_neuron/python/saved_model.py", line 163, in convert_to_inference_model
    protected_op_names=saved_model_main_op, **kwargs)
  File "/home/ubuntu/test_venv/lib/python3.7/site-packages/tensorflow_neuron/python/graph_util.py", line 243, in inference_graph_from_session
    graph = _graph_def_to_graph(graph_def)
  File "/home/ubuntu/test_venv/lib/python3.7/site-packages/tensorflow_neuron/python/graph_util.py", line 460, in _graph_def_to_graph
    importer.import_graph_def(graph_def, name='')
  File "/home/ubuntu/test_venv/lib/python3.7/site-packages/tensorflow_core/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/home/ubuntu/test_venv/lib/python3.7/site-packages/tensorflow_core/python/framework/importer.py", line 405, in import_graph_def
    producer_op_list=producer_op_list)
  File "/home/ubuntu/test_venv/lib/python3.7/site-packages/tensorflow_core/python/framework/importer.py", line 505, in _import_graph_def_internal
    raise ValueError(str(e))
ValueError: Input 1 of node StatefulPartitionedCall was passed float from conv2d/kernel:0 incompatible with expected resource.

The error trace for the second piece of code:

Traceback (most recent call last):
  File "yolo_compile.py", line 23, in <module>
    tfn.saved_model.compile(model_dir, compiled_model_dir)
  File "/home/ubuntu/test_venv/lib/python3.7/site-packages/tensorflow_neuron/python/saved_model.py", line 163, in convert_to_inference_model
    protected_op_names=saved_model_main_op, **kwargs)
  File "/home/ubuntu/test_venv/lib/python3.7/site-packages/tensorflow_neuron/python/graph_util.py", line 243, in inference_graph_from_session
    graph = _graph_def_to_graph(graph_def)
  File "/home/ubuntu/test_venv/lib/python3.7/site-packages/tensorflow_neuron/python/graph_util.py", line 460, in _graph_def_to_graph
    importer.import_graph_def(graph_def, name='')
  File "/home/ubuntu/test_venv/lib/python3.7/site-packages/tensorflow_core/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/home/ubuntu/test_venv/lib/python3.7/site-packages/tensorflow_core/python/framework/importer.py", line 405, in import_graph_def
    producer_op_list=producer_op_list)
  File "/home/ubuntu/test_venv/lib/python3.7/site-packages/tensorflow_core/python/framework/importer.py", line 505, in _import_graph_def_internal
    raise ValueError(str(e))
ValueError: Input 1 of node StatefulPartitionedCall was passed float from conv2d/kernel:0 incompatible with expected resource.
awsryg commented 4 years ago

Hi erranmartinezlleida, thank you for raising this. We are looking into it.

ferranmartinezlleida commented 4 years ago

@awsryg Thank you, I appreciate that

awsryg commented 4 years ago

Hi erranmartinezlleida, can you please try saving the model in TF 1.x format rather than 2.0?

ferranmartinezlleida commented 4 years ago

@awsryg Could be that, I saved it in TF 2.x. I'll try it and I'll let you know the result.

ferranmartinezlleida commented 4 years ago

Hi @awsryg, I tried compiling a model downloaded from tensorhub that was created with tf1 and still got problems, this time different though:

ValueError: batch_size is not sufficient to determine the shape of input tensor Tensor("hub_input/image_tensor:0", shape=(1, ?, ?, 3), dtype=float32)

This is the model used: https://tfhub.dev/google/faster_rcnn/openimages_v4/inception_resnet_v2/1

But I was able to overcome it by modifiying the compiling script:

import os
import time
import shutil
import tensorflow as tf
import tensorflow.neuron as tfn
import tensorflow.compat.v1.keras as keras
from tensorflow.keras.applications.resnet50 import ResNet50
from tensorflow.keras.applications.resnet50 import preprocess_input

WORKSPACE = './ws_yolov4'
os.makedirs(WORKSPACE, exist_ok=True)

model_dir = os.path.join(WORKSPACE, 'saved_model-t1')
compiled_model_dir = os.path.join(WORKSPACE, 'saved_model_neuron')

keras.backend.set_learning_phase(0)
keras.backend.set_image_data_format('channels_last')

model_saved_dir = "./ws_yolov4/d_saved_model-t1"

with tf.Session() as sess:

        _ = tf.compat.v1.saved_model.load(sess,tags=[],export_dir="./ws_yolov4/saved_model-t1")
        zeros_input = tf.compat.v1.keras.initializers.Zeros(dtype=tf.dtypes.float32)(shape=(1,224,224,3))
        zeros_output = tf.compat.v1.keras.initializers.Zeros(dtype=tf.dtypes.float32)(shape=(1,1000))

        tf.saved_model.simple_save(
                    session            = sess,
                    export_dir         = model_saved_dir,
                    inputs             = {'input': zeros_input},
                    outputs            = {'output': zeros_output})

tfn.saved_model.compile(model_saved_dir, compiled_model_dir)

shutil.make_archive('./saved_model-neuron', 'zip', WORKSPACE, 'saved_model_neuron')

Finally I managed to compile it! But the operators of the model are not supported yet by neuron-sdk

WARNING:tensorflow:Converted ./ws_yolov4/d_saved_model-t1 to ./ws_yolov4/saved_model_neuron but no operator will be running on AWS machine learning accelerators. This is probably not what you want. Please refer to https://github.com/aws/aws-neuron-sdk for current limitations of the AWS Neuron SDK. We are actively improving (and hiring)!

I will try to save my custom model with this format and correct operators and see what happens.

One more thing. During the process I also tried to compile another model :

https://tfhub.dev/google/object_detection/mobile_object_localizer_v1/1

and got this other error:

ValueError: Node 'Postprocessor/BatchMultiClassNonMaxSuppression/map/TensorArrayUnstack_1/TensorArrayScatter/TensorArrayScatterV3' expects to be colocated with unknown node 'Postprocessor/raw_box_scores'

So I guess the way you save the model it is very important for the sdk to be able to compile it.

aws-zejdaj commented 4 years ago

@ferranmartinezlleida The TensorArrayScatterV3 operator is not supported by neuron-sdk (runs on in framework / on cpu). You can see the list of supported operators by neuron-cc list-operators --framework TENSORFLOW

However other parts of the model should have been accelerated (they are not when compilation fails and execution thus falls back to framework). Can you share your saved model and compilation log file?

jeffhataws commented 3 years ago

Hi ferranmartinezlleida, I have reproduced the issue with https://tfhub.dev/google/faster_rcnn/openimages_v4/inception_resnet_v2/1 and https://tfhub.dev/google/object_detection/mobile_object_localizer_v1/1 and we are investigating.

ferranmartinezlleida commented 3 years ago

Thank you very much! @jeffhataws I'll be looking at this issue to see your updates. @aws-zejdaj I'm sorry I can't share the model unfortunately, moreover I've been assigned at investigate other things so I'm not operating at full at this issue atm.

mrnikwaws commented 3 years ago

Hi @ferranmartinezlleida - we are continuing to work on this issue and will provide an update when we have made more progress

mrnikwaws commented 3 years ago

Hi @ferranmartinezlleida,

As a follow on, the team is working on Object Detection models, and you can find object detection models on our roadmap here: https://github.com/aws/aws-neuron-sdk/projects/2. In particular we are working on Faster RCNN may be of interest since it is related to https://tfhub.dev/google/faster_rcnn/openimages_v4/inception_resnet_v2/1. Work on this is still in progress.

I’m assuming that our YOLOv4 example did not meet your needs?

ferranmartinezlleida commented 3 years ago

No, I couldn't seem to put my custom model through it. I don't know if I did something wrong, I'm not a full expert on tensorflow, I was just exploring options for a project at my company. What I can assure you tho is that the model worked. I was able to perform correct inferences on python.

aws-taylor commented 3 years ago

Hello @ferranmartinezlleida,

I understand. In that case, I'd suggest you 'Watch' our associated roadmap item for Faster RCNN - https://github.com/aws/aws-neuron-sdk/issues/153 as it progresses. If you are able to give some more specifics related to the error messages you saw related to YOLOv4 we would also be happy to assist with those.

Regards, Taylor

aws-zejdaj commented 3 years ago

@ferranmartinezlleida Please reopen if still running into the issue and provide us the specific error and testcase.