Xilinx / Vitis-AI

Vitis AI is Xilinx’s development stack for AI inference on Xilinx hardware platforms, including both edge devices and Alveo cards.
https://www.xilinx.com/ai
Apache License 2.0
1.49k stars 630 forks source link

NaN in output - running Tensorflow2 classification with softmax on DPUCZDX8G #799

Closed Orchidaceae closed 2 years ago

Orchidaceae commented 2 years ago

Summary

I have trained and successfully quantized and compiled a 3-layer MNIST classification model for DPUCZDX8G. But when I'm trying to run my model on the ZCU102 board I get unexpected NaNs and zeros in the classification output vector where only a few instances of the test images results in random float numbers but with very poor accuracy.

Output examples:

...most of the output:

[[nan 0. nan 0. 0. nan 0. 0. 0. 0.]]

...and in some instances:

[[2.9359882e-30 3.3529205e-04 2.2591801e-06 9.9949145e-01 2.2591801e-06 1.5222234e-08 4.5376844e-05 9.3528639e-14 1.2334704e-04 8.5287086e-17]

I had no issues in the previous deployment steps so what could cause this? Am I somehow interpreting the output wrong?

MNIST model definition

 inputs = tf.keras.layers.Input(shape=(784,))
    dense = tf.keras.layers.Dense(128, activation='relu')(inputs)
    net = tf.keras.layers.Dense(10)(dense)
    prediction = tf.keras.layers.Activation('softmax')(net)
    float_model = tf.keras.Model(inputs=inputs, outputs=prediction, name='mnist_softmax_model')
Model: "mnist_softmax_model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_1 (InputLayer)         [(None, 784)]             0         
_________________________________________________________________
quant_input_1 (VitisQuantize (None, 784)               4         
_________________________________________________________________
quant_dense (QuantizeWrapper (None, 128)               100487    
_________________________________________________________________
quant_dense_relu (QuantizeWr (None, 128)               4         
_________________________________________________________________
quant_dense_1 (QuantizeWrapp (None, 10)                1300      
_________________________________________________________________
activation (Activation)      (None, 10)                0         
=================================================================
Total params: 101,795
Trainable params: 101,770
Non-trainable params: 25
_________________________________________________________________

Evaluation of quantized model

2022-05-12 13:12:36.399445: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:185] None of the MLIR Optimization Passes are enabled (registered 2)
313/313 [==============================] - 1s 968us/step - loss: 0.2457 - sparse_top_k_categorical_accuracy: 0.9963
Evaluate on test data
79/79 [==============================] - 0s 2ms/step - loss: 0.2457 - sparse_top_k_categorical_accuracy: 0.9963
test loss, test acc: [0.24574635922908783, 0.9962999820709229]
Generate predictions for 3 samples
Prediction: 9   Truth:  9
Prediction: 2   Truth:  2
Prediction: 7   Truth:  7

Compiling model for DPUCZDX8G

$ ARCH=/opt/vitis_ai/compiler/arch/DPUCZDX8G/ZCU102/arch.json
$ MODEL=quantized_models/mnist_MLP_softmax_q.h5
$ NAME=mnist_MLP_softmax_c
$ vai_c_tensorflow2 -m $MODEL -a $ARCH -o compiled_models/ -n $NAME
[INFO] Namespace(batchsize=1, inputs_shape=['1,784'], layout='NHWC', model_files=['quantized_models/mnist_MLP_softmax_q.h5'], model_type='tensorflow2', named_inputs_shape=None, out_filename='/tmp/mnist_MLP_softmax_c_org.xmodel', proto=None)
in_shapes: [[1, 784]]
[INFO] tensorflow2 model: /workspace/quantized_models/mnist_MLP_softmax_q.h5
[INFO] keras version: 2.6.0
[INFO] Tensorflow Keras model type: functional                      
[INFO] parse raw model     :100%|██████████| 6/6 [00:00<00:00, 12222.35it/s]                                  
[INFO] infer shape (NHWC)  :100%|██████████| 12/12 [00:00<00:00, 13213.87it/s]                           
[INFO] perform level-0 opt :100%|██████████| 2/2 [00:00<00:00, 528.38it/s]                                      
[INFO] perform level-1 opt :100%|██████████| 2/2 [00:00<00:00, 1931.08it/s]                                        
[INFO] generate xmodel     :100%|██████████| 12/12 [00:00<00:00, 769.08it/s]                 
[INFO] dump xmodel ...
[INFO] dump xmodel: /tmp/mnist_MLP_softmax_c_org.xmodel
[UNILOG][INFO] Compile mode: dpu
[UNILOG][INFO] Debug mode: function
[UNILOG][INFO] Target architecture: DPUCZDX8G_ISA0_B4096_MAX_BG2
[UNILOG][INFO] Graph name: mnist_softmax_model, with op num: 16
[UNILOG][INFO] Begin to compile...
[UNILOG][INFO] Total device subgraph number 4, DPU subgraph number 1
[UNILOG][INFO] Compile done.
[UNILOG][INFO] The meta json is saved to "/workspace/compiled_models/meta.json"
[UNILOG][INFO] The compiled xmodel is saved to "/workspace/compiled_models//mnist_MLP_softmax_c.xmodel"
[UNILOG][INFO] The compiled xmodel's md5sum is 5706a3276db2dc636b0448d03f5c121a, and has been saved to "/workspace/compiled_models/md5sum.txt"
**************************************************
* VITIS_AI Compilation - Xilinx Inc.
*************************************************

mnist_MLP_softmax_c

Python VART code for deployment on ZCU102

import xir
import vart
from vitis_ai_library import GraphRunner
import numpy as np
import sys
import tensorflow as tf
import math

def trans_3D_to_2D(tensor):
    if type(tensor) is not np.ndarray:
        tensor = np.array(tensor)
    shape = tensor.shape
    if len(shape) != 3:
        print("Tensor needs to be 3D")
        return None
    # keep batch size (dim 0) dimension
    return tensor.reshape(shape[0], shape[1]*shape[2])

"""Import MNIST data"""
mnist = tf.keras.datasets.mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()
# reshape data to 2D tensors (batch, pixel vector)
#x_train = trans_3D_to_2D(x_train)
x_test = trans_3D_to_2D(x_test)

"""Import model execution graph"""
model_name = str(sys.argv[1])#'models/MLP_2L.xmodel'
g = xir.Graph.deserialize(model_name)

"""Print graph Info"""
subgraphs = g.get_root_subgraph().toposort_child_subgraph()
for s in subgraphs: print(s.get_name())

runner = GraphRunner.create_graph_runner(g)
print("created runner")

"""Define I/O buffers"""
input_tensor_buffers = runner.get_inputs()
output_tensor_buffers = runner.get_outputs()

input_ndim = tuple(input_tensor_buffers[0].get_tensor().dims)
batch = input_ndim[0]
width = input_ndim[1]
try:
    height = input_ndim[2]
except:
    height = "no dim"
else:
    height = input_ndim[2]
fixpos = input_tensor_buffers[0].get_tensor().get_attr("fix_point")

print("INPUT INFO:\t", input_ndim, "\nBATCH\t", batch, "\nWIDTH\t", width, "\nHEIGHT\t", height, "\nFIXPOS\t", fixpos)

"""init input data """
input_data = np.asarray(input_tensor_buffers[0])

for imgnr in range(10):               #Test the first 10 MNIST images
    input_data[0] = x_test[imgnr]
    #show_img(input_data[0])

    """ run graph runner"""
    job_id = runner.execute_async(input_tensor_buffers, output_tensor_buffers)
    runner.wait(job_id)
    print("started graph runner")

    pre_output_size = int(output_tensor_buffers[0].get_tensor().get_element_num() / batch)
    raw_output = output_tensor_buffers[0]
    output_data = np.asarray(raw_output, dtype=np.float32, order="C")
    pred = np.argmax(output_data)

    # print output arrays without NaN and just a few with
    if not(np.isnan(output_data).any()) or (imgnr%201 == 0): 
        print("IMAGE NR ", imgnr)
        print("OUTPUT:\n", output_data, "\nRAW OUTPUT:\n", raw_output, "\nPRE OUTPUT SIZE:\n", pre_output_size, "\nPREDICTION:\t", pred, "\tTRUTH:\t", y_test[imgnr])

Results of running the model

The great majority of all test images in the data set gives an output with zeros and nan as shown below as for example image nr 0.

root@xilinx-zcu102-2021_2:~/loo-vart-exploration# python mnist_runner.py models/mnist_MLP_softmax_c.xmodel 
subgraph_input_1
subgraph_quant_input_1_reshaped
subgraph_quant_dense(TransferMatMulToConv2d)
subgraph_activation
created runner
INPUT INFO:  (1, 784) 
BATCH    1 
WIDTH    784 
HEIGHT   no dim 
FIXPOS   -2
started graph runner
IMAGE NR  0
OUTPUT:
 [[nan  0. nan  0.  0. nan  0.  0.  0.  0.]] 
RAW OUTPUT:
 TensorBuffer{@0xaaaac273f930,tensor=xir::Tensor{name = activation, type = FLOAT32, shape = {1, 10}},location=HOST_VIRT,data=[(Virt=0xaaaac431eaa0, 40)]} 
PRE OUTPUT SIZE:
 10 
PREDICTION:  0  TRUTH:   7
started graph runner
started graph runner
started graph runner
started graph runner
started graph runner
started graph runner
started graph runner
started graph runner
started graph runner

Only a few of all test images generate an output that does not contain zeros and nan, like for example image nr 34:

IMAGE NR  34
OUTPUT:
 [[3.6964852e-41 3.3333334e-01 3.3333334e-01 1.7476286e-22 1.0737801e-27
  3.3333334e-01 9.5417287e-21 2.6616346e-30 1.3799792e-08 4.3319380e-25]] 
RAW OUTPUT:
 TensorBuffer{@0xaaaafa467990,tensor=xir::Tensor{name = activation, type = FLOAT32, shape = {1, 10}},location=HOST_VIRT,data=[(Virt=0xaaaafc474630, 40)]} 
PRE OUTPUT SIZE:
 10 
PREDICTION:  1  TRUTH:   7

The resulting accuracy of these results is very poor.

zz002 commented 2 years ago

Hi @Orchidaceae The input and output of the model are in fixed-point format, and the fixed-point position needs to be considered when writing and reading data.src

qianglin-xlnx commented 2 years ago

Hi @Orchidaceae Is this issue solved?

qianglin-xlnx commented 2 years ago

Hi @Orchidaceae Since we haven't received your reply for a long time, we assume you have solved this issue and I'm going to close it. If you still have any questions, please feel free to reopen it. Thank you very much.