The kernel appears to have died when trying to retrain ResNet50 for MNIST classification for PYNQ-DPU compilation

Afef00 commented 2 years ago

I have been trying to retrain ResNet50 for MNIST classification using the code below following the provided example Build Machine Learning Models for DPU However I got the following message The kernel appears to have died. It will restart automatically.

import os
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.ERROR)
import keras
from keras.layers import Dense, Conv2D, InputLayer, Flatten, MaxPool2D

(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data() 
x_train = np.expand_dims(x_train, axis=-1)
x_test = np.expand_dims(x_test, axis=-1)
x_train = np.repeat(x_train, 3, axis=-1)
x_test = np.repeat(x_test, 3, axis=-1)
x_train = x_train.astype('float32') / 255
x_test = x_test.astype('float32') / 255
x_train = tf.image.resize(x_train, [32,32]) 
x_test = tf.image.resize(x_test, [32,32])
y_train = tf.keras.utils.to_categorical(y_train , num_classes=10)
y_test = tf.keras.utils.to_categorical(y_test , num_classes=10)
print(x_train.shape, y_train.shape)
print(x_test.shape, y_test.shape)

input = tf.keras.Input(shape=(32,32,3))
efnet = tf.keras.applications.ResNet50(weights='imagenet',
                                             include_top = False, 
                                             input_tensor = input)
gap = tf.keras.layers.GlobalMaxPooling2D()(efnet.output)

output = tf.keras.layers.Dense(10, activation='softmax', use_bias=True)(gap)
func_model = tf.keras.Model(efnet.input, output)
func_model.compile(optimizer='adam',
              loss="sparse_categorical_crossentropy", 
              metrics=['accuracy'])
func_model.fit(x_train, y_train,epochs=5,validation_data=(x_test,y_test),steps_per_epoch = 1)

Any suggestions how to solve this problem please? Thanks in advance.

skalade commented 2 years ago

Hi there,

You should be using categorical_crossentropy instead of sparse if your labels are one-hot encoded, this should be throwing an error. The kernel might be dying because you're running out of memory trying to process a massive batch resulting from the steps_per_epoch parameter in your fit function set to 1 -- this results in your batch size being equal to your entire training set. I'd change it to 60000//batch_size, where batch_size=32 or some other smaller value.

Thanks Shawn

Afef00 commented 2 years ago

hello Shawn, Thank you for the prompt reply. Actually the problem persists even with small amount of dataset xtrain=x_train[0:5000] ytrain=y_train[0:5000] batch_size = 32 func_model.fit(xtrain, ytrain, batch_size= batch_size, epochs=5, steps_per_epoch = 5000//batch_size,verbose = 2)

And for the use of steps_per_epoch I used because when fitting the model I got the following message error ValueError: When using data tensors as input to a model, you should specify the steps_per_epoch argument.

Thanks

skalade commented 2 years ago

So the kernel keeps dying? Is there any output on your terminal where you launched the jupyter notebook? The code snippet you provided works for me on a fresh docker image (vitis-ai-cpu:1.4.916) with the vitis-ai-tensorflow2 conda environment sourced. I just changed the loss function and the steps_per_epoch parameter as mentioned earlier. You also don't need to install or import keras as that is built into tensorflow2 now.

import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf

(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()

x_train = np.expand_dims(x_train, axis=-1)
x_test = np.expand_dims(x_test, axis=-1)
x_train = np.repeat(x_train, 3, axis=-1)
x_test = np.repeat(x_test, 3, axis=-1)
x_train = x_train.astype('float32') / 255
x_test = x_test.astype('float32') / 255
x_train = tf.image.resize(x_train, [32,32]) 
x_test = tf.image.resize(x_test, [32,32])
y_train = tf.keras.utils.to_categorical(y_train , num_classes=10)
y_test = tf.keras.utils.to_categorical(y_test , num_classes=10)
print(x_train.shape, y_train.shape)
print(x_test.shape, y_test.shape)

input = tf.keras.Input(shape=(32,32,3))
efnet = tf.keras.applications.ResNet50(weights='imagenet',
                                             include_top = False, 
                                             input_tensor = input)
gap = tf.keras.layers.GlobalMaxPooling2D()(efnet.output)

output = tf.keras.layers.Dense(10, activation='softmax', use_bias=True)(gap)
func_model = tf.keras.Model(efnet.input, output)

func_model.compile(optimizer='adam',
              loss="categorical_crossentropy", 
              metrics=['accuracy'])

func_model.fit(x_train, y_train, epochs=5, validation_data=(x_test,y_test),
               steps_per_epoch = 60000//32)

If you're still having issues with training on the docker I'd recommend going to the Vitis AI issue tracker.

Thanks Shawn

Afef00 commented 2 years ago

Hello Shawn, Thank you for your help, it works ! Best regards Afef00

Xilinx / DPU-PYNQ

The kernel appears to have died when trying to retrain ResNet50 for MNIST classification for PYNQ-DPU compilation #75