aws-neuron / aws-neuron-sdk

Powering AWS purpose-built machine learning chips. Blazing fast and cost effective, natively integrated into PyTorch and TensorFlow and integrated with your favorite AWS services
https://aws.amazon.com/machine-learning/neuron/
Other
449 stars 151 forks source link

Runtime Error: torch.neuron.trace model #470

Closed di0002ya closed 2 years ago

di0002ya commented 2 years ago

Hi team, I tried to compiled model by using pytorch neuron and deployed to inferentia.

I firstly run: torch.neuron.analyze_model(det_model, inputs)

Output shows all operations are supported.

INFO:Neuron:100.00% of all operations (including primitives) (1217 of 1217) are supported
INFO:Neuron:100.00% of arithmetic operations (160 of 160) are supported

However, when I tried model_neuron = torch.neuron.trace(det_model, inputs )

I got runtime error. May I know how to solve it? Thanks in advance!

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
/tmp/ipykernel_2684/3971781750.py in <module>
      1 # that that is optimized for the Inferentia hardware
----> 2 model_neuron = torch.neuron.trace(det_model,  inputs )
      3 # The output of the compilation step will report the percentage of operators that
      4 # are compiled to Neuron, for example:
      5 #

~/.pyenv/versions/3.7.13/envs/neuron/lib/python3.7/site-packages/torch_neuron/convert.py in trace(func, example_inputs, fallback, op_whitelist, minimum_segment_size, subgraph_builder_function, subgraph_inputs_pruning, skip_compiler, debug_must_trace, allow_no_ops_on_neuron, compiler_workdir, dynamic_batch_size, compiler_timeout, _neuron_trace, compiler_args, optimizations, verbose, **kwargs)
    166     with skip_inference_context() as s:
    167         neuron_graph = cu.compile_fused_operators(neuron_graph, **compile_kwargs)
--> 168     cu.stats_post_compiler(neuron_graph)
    169 
    170     # Wrap the compiled version of the model in a script module. Note that this is

~/.pyenv/versions/3.7.13/envs/neuron/lib/python3.7/site-packages/torch_neuron/convert.py in stats_post_compiler(self, neuron_graph)
    498         if succesful_compilations == 0 and not self.allow_no_ops_on_neuron:
    499             raise RuntimeError(
--> 500                 "No operations were successfully partitioned and compiled to neuron for this model - aborting trace!")
    501 
    502         if percent_operations_compiled < 50.0:

RuntimeError: No operations were successfully partitioned and compiled to neuron for this model - aborting trace!
ERROR:Neuron:neuron-cc failed with the following command line call:
/workspace/.pyenv/versions/3.7.13/envs/neuron/bin/neuron-cc compile /tmp/tmpx7n7uwpf/graph_def.pb --framework TENSORFLOW --pipeline compile SaveTemps --output /tmp/tmpx7n7uwpf/graph_def.neff --io-config '{"inputs": {}, "outputs": []}' --verbose 35
Traceback (most recent call last):
  File "/workspace/.pyenv/versions/3.7.13/envs/neuron/lib/python3.7/site-packages/torch_neuron/convert.py", line 389, in op_converter
    item, inputs, compiler_workdir=sg_workdir, **kwargs)
  File "/workspace/.pyenv/versions/3.7.13/envs/neuron/lib/python3.7/site-packages/torch_neuron/decorators.py", line 221, in trace
    'neuron-cc failed with the following command line call:\n{}'.format(command))
subprocess.SubprocessError: neuron-cc failed with the following command line call:
/workspace/.pyenv/versions/3.7.13/envs/neuron/bin/neuron-cc compile /tmp/tmpx7n7uwpf/graph_def.pb --framework TENSORFLOW --pipeline compile SaveTemps --output /tmp/tmpx7n7uwpf/graph_def.neff --io-config '{"inputs": {}, "outputs": []}' --verbose 35
aws-taylor commented 2 years ago

Hello @di0002ya,

The issue you are running into is related to the empty --io-config '{"inputs": {}, "outputs": []}'. Are you able to share the commands necessary to reproduce this issue?

-Taylor

di0002ya commented 2 years ago

Hi @aws-taylor , thanks for your help! I tried to re-organize my model wrapper yesterday.

import logging 
import warnings
from ts.torch_handler.base_handler import BaseHandler
import os  
import torch
import numpy as np 
from collections import OrderedDict 
import base64
import io
from PIL import Image 
import math
import cv2
import json 
from mmocr.apis import init_detector, model_inference
from mmocr.utils.model import revert_sync_batchnorm 
from mmocr.datasets.pipelines.crop import crop_img
from mmcv import imfrombytes
import torch.nn as nn 

def iminvert(img):
    """Invert (negate) an image.

    Args:
        img (ndarray): Image to be inverted.

    Returns:
        ndarray: The inverted image.
    """
    return np.full_like(img, 255) - img

class WrapDet(nn.Module):
    def __init__(self):
        super(WrapDet, self).__init__()
        model_dir = 'mmocr'
        # det_ckpt = "checkpoints/textdet/dbnetpp/dbnetpp_r50dcnv2_fpnc_1200e_icdar2015-20220502-d7a76fff.pth"
        det_ckpt = "checkpoints/dbnet_r18_fpnc_sbn_1200e_icdar2015_20210329-ba3ab597.pth"
        recog_ckpt = "checkpoints/textrec/abinet_academic-f718abf6.pth"
        det_config = "dbr18config.py"
        recog_config = "abinet_recog_config.py"

        det_ckpt = os.path.join(model_dir, det_ckpt)     
        det_config = os.path.join(model_dir, det_config) 

        recog_model = init_detector(recog_config, recog_ckpt, device="cpu")
        self.recog_model = revert_sync_batchnorm(recog_model)
        detect_model = init_detector(det_config, det_ckpt, device="cpu")
        self.detect_model = revert_sync_batchnorm(detect_model)

        self.batch_mode = False 
        self.recog_batch_size = 1
    def forward(self, inputs):
        image = transforms.ToPILImage()(inputs)
        arr = iminvert(image)   
        det_result = self.single_inference(self.detect_model, [arr], batch_mode = False, batch_size = 0)

        # text recognition  
        bboxes_list = [res['boundary_result'] for res in det_result]
        print(f"total bboxes list:{len(bboxes_list)}")
        bboxes = bboxes_list[0]
        end2end_res = []
        img_e2e_res = {} 
        img_e2e_res['result'] = []
        box_imgs = []
        for bbox in bboxes:
            box_res = {}
            box_res['box'] = [round(x) for x in bbox[:-1]]
            box_res['box_score'] = float(bbox[-1])
            box = bbox[:8]
            if len(bbox) > 9:
                min_x = min(bbox[0:-1:2])
                min_y = min(bbox[1:-1:2])
                max_x = max(bbox[0:-1:2])
                max_y = max(bbox[1:-1:2])
                box = [
                    min_x, min_y, max_x, min_y, max_x, max_y, min_x, max_y
                ]
            box_img = crop_img(arr, box)
            if self.batch_mode:
                box_imgs.append(box_img)
            else:
                recog_result = model_inference(self.recog_model, box_img)
                text = recog_result['text']
                text_score = recog_result['score']
                if isinstance(text_score, list):
                    text_score = sum(text_score) / max(1, len(text))
                box_res['text'] = text
                box_res['text_score'] = text_score
            img_e2e_res['result'].append(box_res)

        if self.batch_mode:
            recog_results = self.single_inference(
                self.recog_model, box_imgs, batch_mode=self.batch_mode, batch_size=self.recog_batch_size)
            for i, recog_result in enumerate(recog_results):
                text = recog_result['text']
                text_score = recog_result['score']
                if isinstance(text_score, (list, tuple)):
                    text_score = sum(text_score) / max(1, len(text))
                img_e2e_res['result'][i]['text'] = text
                img_e2e_res['result'][i]['text_score'] = text_score
        end2end_res.append(img_e2e_res) 

        # return tuple(end2end_res[0]['result'])
        return (torch.FloatTensor([end2end_res[0]['result'][0]['box_score']]))

    def single_inference(self, model, arrays, batch_mode, batch_size):
        result = []
        if batch_mode:
            if batch_size == 0:
                result = model_inference(model, arrays, batch_mode=True)
            else:
                n = batch_size
                arr_chunks = [
                    arrays[i:i + n] for i in range(0, len(arrays), n)
                ]
                for chunk in arr_chunks:
                    result.extend(model_inference(model, chunk, batch_mode=True))
        else:
            for arr in arrays:
                result.append(model_inference(model, arr, batch_mode=False))
        return result

Before conversion, I test det_model(inputs)without any error. Then I proceed to model conversion: model_neuron = torch.neuron.trace(det_model, example_inputs=[inputs], strict = False) inference_check.html.zip

The log are shown in attached file.

di0002ya commented 2 years ago

After running model_neuron = torch.neuron.trace(det_model, example_inputs=[inputs], strict = False). it shows WARNING:Neuron:torch.neuron.trace was unable to compile > 50% of the operators in the compiled model! WARNING:Neuron:Please review the torch.neuron.analyze_model output and if you believe you are seeing a failure WARNING:Neuron:Lodge an issue on https://github.com/aws/aws-neuron-sdk/issues if you believe the model is not compiling as expected without breaking the process.

However, when I tried to run inference on neuron model using following command:

model_neuron.eval()
model_neuron(inputs)

Log was shown below:

2022-Aug-18 08:22:45.0374  5279:5279  ERROR   NRT:nrt_init                                Unable to read compatible Driver version.
2022-Aug-18 08:22:45.0374  5279:5279  ERROR   NRT:nrt_init                                Please check /dev/neuron# is accessible. If you're using containers, please ensure Neuron Devices are passed to the container by specifying `--device /dev/neuron#`.
aws-taylor commented 2 years ago

Hello @di0002ya ,

Are you able to share the associated model configuration and checkpoint files? We have a candidate fix, but without a reproduction it's difficult to test. Specifically these files:

  det_ckpt = "checkpoints/dbnet_r18_fpnc_sbn_1200e_icdar2015_20210329-ba3ab597.pth"
  recog_ckpt = "checkpoints/textrec/abinet_academic-f718abf6.pth"
  det_config = "dbr18config.py"
  recog_config = "abinet_recog_config.py"