fastmachinelearning / hls4ml

Machine learning on FPGAs using HLS
https://fastmachinelearning.org/hls4ml
Apache License 2.0
1.25k stars 407 forks source link

Number of filters limitation #572

Closed wilfredkisku closed 1 year ago

wilfredkisku commented 2 years ago

Is there a limitation to the number of filters in a CNN, as a layer with 32 filters tends to be the bottleneck during synthesis. The synthesis is unable to complete and gets stuck at the conv2d layer having 32 filter.

vloncar commented 2 years ago

Depends on your config. Assuming you use io_stream, the limit will be related to the strategy used, since that affects the algorithm used for the CNN kernel. If you use latency strategy (the default), then filt_height x filt_width x n_channels x n_filters < 4096. If you use io_parallel, well, you shouldn't be using it with large models.

wilfredkisku commented 2 years ago

I have used the io_stream and the strategy used is Resource. Another issue that I am facing is that when I am using the config file to generate the hls4ml model from the quantized models, it results in an accuracy drop from ~75% to ~10%. I check that the baseline quantized model does predict with an accuracy of around ~71% but the hls4ml model gives drops the accuracy after synthesising the model.

jmduarte commented 2 years ago

Hi @wilfredkisku, can you share your model. You may look into tracing / profiling functionality.

You can make 1D plots of the expected output vs hls4ml output like so: https://github.com/hls4ml-finn-mlperftiny/CIFAR10/blob/main/hls4ml/convert.py#L222-L233

profiling_18_q_dense

This can help you pinpoint which layers are causing a mismatch. Then you can increase the precision of those layers. Usually it's required to either increase the precision of the outputs or the accumulators (or both).

wilfredkisku commented 2 years ago

Thanks @jmduarte for the help. I am including the model to give a better picture of the model that I am trying to synthesize using hls4ml. I have been testing with IFM bit precision of 4 and weight precision of 12, also, 16 and 16 but still the accuracy for the keras model and hls4ml model differs alot.

from qkeras import QActivation
from qkeras import QDense, QConv2DBatchnorm

IFM = 16
WGT = 16

def QResNet9(input_shape = (32,32,3), classes = 10):
    img_input = Input(shape=input_shape)
    x = QActivation('quantized_relu('+str(IFM)+',0)',name='relu_in')(img_input)

    x = QConv2DBatchnorm(16, kernel_size = (3,3), strides=(1,1),
                         kernel_quantizer="quantized_bits("+str(WGT)+",0,alpha=1)",
                         kernel_initializer='lecun_uniform',  
                         kernel_regularizer=l1(0.0001), name='conv1', use_bias = True)(x)
    x = QActivation('quantized_relu('+str(IFM)+',0)', name='relu_conv1')(x)

    x = QConv2DBatchnorm(16, kernel_size = (3,3), strides=(1,1),
                         kernel_quantizer="quantized_bits("+str(WGT)+",0,alpha=1)",
                         kernel_initializer='lecun_uniform',  
                         kernel_regularizer=l1(0.0001), name='conv2', use_bias = True)(x)
    x = QActivation('quantized_relu('+str(IFM)+',0)', name='relu_conv2')(x)
    x = MaxPooling2D(pool_size=(2, 2), name='pool1')(x)

    x_skip = x
    x_skip = QConv2DBatchnorm(16, kernel_size = (3,3), strides=(1,1),
                         kernel_quantizer="quantized_bits("+str(WGT)+",0,alpha=1)",
                         kernel_initializer='lecun_uniform',  
                         kernel_regularizer=l1(0.0001), name='conv3', use_bias = True)(x_skip)
    x_skip = QActivation('quantized_relu('+str(IFM)+',0)', name='relu_conv3_skip')(x_skip)

    x = QConv2DBatchnorm(16, kernel_size = (3,3), strides=(1,1),
                         kernel_quantizer="quantized_bits("+str(WGT)+",0,alpha=1)",
                         kernel_initializer='lecun_uniform',  
                         kernel_regularizer=l1(0.0001), name='conv4', use_bias = True)(x)
    x = QActivation('quantized_relu('+str(IFM)+',0)', name='relu_conv4')(x)

    #x = QConv2DBatchnorm(16, kernel_size = (3,3), strides=(1,1),
    #                     kernel_quantizer="quantized_bits("+str(WGT)+",0,alpha=1)",
    #                     kernel_initializer='lecun_uniform',  
    #                     kernel_regularizer=l1(0.0001), name='conv4', use_bias = True)(x)
    #x = QActivation('quantized_relu('+str(IFM)+',0)', name='relu_conv4')(x)

    x = Add()([x, x_skip])

    x = QConv2DBatchnorm(24, kernel_size = (3,3), strides=(1,1),
                         kernel_quantizer="quantized_bits("+str(WGT)+",0,alpha=1)",
                         kernel_initializer='lecun_uniform',  
                         kernel_regularizer=l1(0.0001), name='conv5', use_bias = True)(x)
    x = QActivation('quantized_relu('+str(IFM)+',0)', name='relu_conv5')(x)
    x = MaxPooling2D(pool_size=(2, 2), name='pool2')(x)

    #x = QConv2DBatchnorm(32, kernel_size = (3,3), strides=(1,1),
    #                     kernel_quantizer="quantized_bits("+str(WGT)+",0,alpha=1)",
    #                     kernel_initializer='lecun_uniform',  
    #                     kernel_regularizer=l1(0.0001), name='conv6', use_bias = True)(x)
    #x = QActivation('quantized_relu('+str(IFM)+',0)', name='relu_conv6')(x)
    #x = MaxPooling2D(pool_size=(2, 2), name='pool3')(x)

    x_skip = x
    x_skip = QConv2DBatchnorm(24, kernel_size = (3,3), strides=(1,1),
                         kernel_quantizer="quantized_bits("+str(WGT)+",0,alpha=1)",
                         kernel_initializer='lecun_uniform',  
                         kernel_regularizer=l1(0.0001), name='conv6', use_bias = True)(x_skip)
    x_skip = QActivation('quantized_relu('+str(IFM)+',0)', name='relu_conv6_skip')(x_skip)

    x = QConv2DBatchnorm(24, kernel_size = (3,3), strides=(1,1),
                         kernel_quantizer="quantized_bits("+str(WGT)+",0,alpha=1)",
                         kernel_initializer='lecun_uniform',  
                         kernel_regularizer=l1(0.0001), name='conv7', use_bias = True)(x)
    x = QActivation('quantized_relu('+str(IFM)+',0)', name='relu_conv7')(x)
    #x = MaxPooling2D(pool_size=(2, 2), name='pool4')(x)

    #x = QConv2DBatchnorm(32, kernel_size = (3,3), strides=(1,1),
    #                     kernel_quantizer="quantized_bits("+str(WGT)+",0,alpha=1)",
    #                     kernel_initializer='lecun_uniform',  
    #                     kernel_regularizer=l1(0.0001), name='conv8', use_bias = True)(x)
    #x = QActivation('quantized_relu('+str(IFM)+',0)', name='relu_conv8')(x)

    x = Add()([x, x_skip])
    x = MaxPooling2D()(x)
    #x = MaxPooling2D(pool_size=(2, 2), name='pool5')(x)

    x = Flatten()(x)
    x = Dense(10,name='output_dense')(x)
    x_out = Activation('softmax',name='output_softmax')(x)

    qmodel = Model(inputs=[img_input], outputs=[x_out], name='qkeras')
    return qmodel
Accuracy Keras:  0.7033333333333334
Accuracy hls4ml: 0.11666666666666667

I am including the configuration for hls4ml that I have used.

# Then the QKeras model
hls4ml.model.optimizer.OutputRoundingSaturationMode.layers = ['Activation']
hls4ml.model.optimizer.OutputRoundingSaturationMode.rounding_mode = 'AP_RND'
hls4ml.model.optimizer.OutputRoundingSaturationMode.saturation_mode = 'AP_SAT'

hls_config_q = hls4ml.utils.config_from_keras_model(qmodel, granularity='name')
hls_config_q['Model']['Strategy'] = 'Resource'
hls_config_q['Model']['ReuseFactor'] = 144
hls_config_q['Model']['Precision'] = 'ap_fixed<16,6>'

hls_config_q['LayerName']['conv1']['Strategy'] = 'Resource'
hls_config_q['LayerName']['conv1']['ReuseFactor'] = 108

hls_config_q['LayerName']['conv2']['Strategy'] = 'Resource'
hls_config_q['LayerName']['conv2']['ReuseFactor'] = 144

hls_config_q['LayerName']['conv3']['Strategy'] = 'Resource'
hls_config_q['LayerName']['conv3']['ReuseFactor'] = 144

hls_config_q['LayerName']['conv4']['Strategy'] = 'Resource'
hls_config_q['LayerName']['conv4']['ReuseFactor'] = 144

hls_config_q['LayerName']['conv5']['Strategy'] = 'Resource'
hls_config_q['LayerName']['conv5']['ReuseFactor'] = 144

hls_config_q['LayerName']['conv7']['Strategy'] = 'Resource'
hls_config_q['LayerName']['conv7']['ReuseFactor'] = 144

hls_config_q['LayerName']['conv6']['Strategy'] = 'Resource'
hls_config_q['LayerName']['conv6']['ReuseFactor'] = 144

hls_config_q['LayerName']['output_dense']['Strategy'] = 'Resource'
hls_config_q['LayerName']['output_dense']['ReuseFactor'] = 160

hls_config_q['LayerName']['output_softmax']['Strategy'] = 'Stable'
plotting.print_dict(hls_config_q)

cfg_q = hls4ml.converters.create_config(backend='Vivado')
cfg_q['IOType']     = 'io_stream' # Must set this if using CNNs!
cfg_q['HLSConfig']  = hls_config_q
cfg_q['KerasModel'] = qmodel
cfg_q['OutputDir']  = 'quantized_cnn_model_C/'
cfg_q['XilinxPart'] = 'xczu7ev-ffvc1156-2-e'
#cfg_q['XilinxPart'] = 'xcu250-figd2104-2L-e'

hls_model_q = hls4ml.converters.keras_to_hls(cfg_q)
hls_model_q.compile()

Few other details I want to include are:

  1. I am using a Ubuntu installed in a VM that has a RAM of ~45 GB allocated to it.
  2. If I use more than 16 filters for any of the layers the synthesis build gets stuck while trying to during loop unrolling for convolutional layer for more than 16 filters, is there a way I can increase the number of filters and be able to synthesize without this issue?
  3. I am using Xilinx Ultrascale+ MPSoC ZCU104 .
liuhao-97 commented 2 years ago

Hi have you solve your problem? I also met the accracy problem when I tested on Resnet.

liuhao-97 commented 2 years ago

Hi have you solve your problem? Maybe you can compare the output of the "output_softmax" layer between the hls_model and the keras model. this is my problem. https://github.com/fastmachinelearning/hls4ml/issues/590

wilfredkisku commented 2 years ago

@liuhao-97 thank you for the reply. No i could not get it corrected, it's still has the accuracy drop. Is there anything else to rectify the issue that you have pointed out?

liuhao-97 commented 2 years ago

Hi have you tried with full precision model (ap_fix<32,16>)? I mean don't quantize the model and set hls config to ap_fix<32,16>. Maybe you can check the output of the last sofmax layer between the keras model and hls model.

liuhao-97 commented 2 years ago

For me I found there might be some problem with the softmax layer. I print the output of dense layer and it works fine. But for softmax layer, the output is totally different. If you check this link https://github.com/hls4ml-finn-mlperftiny/CIFAR10/blob/main/hls4ml/convert.py you will find it remove the softmax layer. so I assume there might be some problem with softmax layer.

wilfredkisku commented 2 years ago

@liuhao-97 I tried to profile the layer but came up with a `graph disconnected error'.

ValueError: Graph disconnected: cannot obtain value for tensor KerasTensor(type_spec=TensorSpec(shape=(None, 32, 32, 1), dtype=tf.float32, name='input_1'), name='input_1', description="created by layer 'input_1'") at layer "prune_low_magnitude_conv1". The following previous layers were accessed without issue: []
wilfredkisku commented 2 years ago

Hi have you tried with full precision model (ap_fix<32,16>)? I mean don't quantize the model and set hls config to ap_fix<32,16>. Maybe you can check the output of the last sofmax layer between the keras model and hls model.

The full precision layer works file, for me the accuracy drops only for the quantized hls model.

liuhao-97 commented 2 years ago

Can you print your output? Is it consisted of some same number and some zeros like [0.25, 0.25, 0.25, 0, 0, 0]?

wilfredkisku commented 2 years ago

I am not able to print the output yet.

liuhao-97 commented 2 years ago

@liuhao-97 I tried to profile the layer but came up with a `graph disconnected error'.

ValueError: Graph disconnected: cannot obtain value for tensor KerasTensor(type_spec=TensorSpec(shape=(None, 32, 32, 1), dtype=tf.float32, name='input_1'), name='input_1', description="created by layer 'input_1'") at layer "prune_low_magnitude_conv1". The following previous layers were accessed without issue: []

Did you prune the model? I think quantized pruned model can't work fine with hls4ml. Maybe you can try quantized model but don't pruned it.

liuhao-97 commented 2 years ago

Besides which hls4ml are you using? hls4ml 0.6.0 or the newest branch?

wilfredkisku commented 2 years ago

I am using hls4ml 0.6.0. Has this issue been resolved in the new branch?

liuhao-97 commented 2 years ago

I am using hls4ml 0.6.0. Has this issue been resolved in the new branch?

not sure. You can have a try with new brench. Besides, have you tried with "io_type='io_parallel'"? Maybe it can sove the problem.

liuhao-97 commented 2 years ago

I am using hls4ml 0.6.0. Has this issue been resolved in the new branch?

not sure. You can have a try with new brench. Besides, have you tried with "io_type='io_parallel'"? Maybe it can sove the problem.

https://github.com/fastmachinelearning/hls4ml/pull/448 Maybe you can check this.

wilfredkisku commented 2 years ago

@liuhao-97 I tried but it still did not work for me. Did you get a workaround to make sure that the accuracy does not drop?

liuhao-97 commented 2 years ago

I think it is because you prune the model. When you prune the model, somehow the original connection of the model layer by layer goes wrong, which can be seen in your error.
Can you try with a non-pruning model again to see if there is still accuracy loss?

@liuhao-97 I tried to profile the layer but came up with a `graph disconnected error'.

ValueError: Graph disconnected: cannot obtain value for tensor KerasTensor(type_spec=TensorSpec(shape=(None, 32, 32, 1), dtype=tf.float32, name='input_1'), name='input_1', description="created by layer 'input_1'") at layer "prune_low_magnitude_conv1". The following previous layers were accessed without issue: []
wilfredkisku commented 2 years ago

@liuhao-97 yes the error has been removed after I removed pruning. Thank you.

wilfredkisku commented 2 years ago

@jmduarte models that have concatenate layer or add layers have a considerable accuracy drop. Might be a bug.