Dense layer only outputs correct values for the first element of each batch when ran on FPGA

brunorochapaiva commented 3 years ago

Hi,

I am trying to test the correctness of some small models when ran on an FPGA by comparing their results with the initial TensorFlow counterparts. Right now I am trying this with a simple model containing just a Dense layer, which takes in a batch of vectors (4,1,1,x) and outputs another batch of vectors (4,1,1,y). Now, the weirdness depends on the value of x, if it is less than 49, then the output for the first batch element is repeated for all the other batch elements, despite them having different inputs. If x is 49 or more, then the individual batch outputs are all different, but only the first output of the batch is correct.

For example, if I let x = 10 and y = 5, I get the output:

[[[[ 0.453125  0.53125  -0.0625    0.953125 -0.671875]]]
 [[[ 0.453125  0.53125  -0.0625    0.953125 -0.671875]]]
 [[[ 0.453125  0.53125  -0.0625    0.953125 -0.671875]]]
 [[[ 0.453125  0.53125  -0.0625    0.953125 -0.671875]]]]

while the expected output is:

[[[[ 0.4522787   0.52991605 -0.05217599  0.9632916  -0.65929407]]]
 [[[ 0.00188637  0.97306997 -0.2423391   0.8426523  -0.7733854 ]]]
 [[[ 0.5069117   0.65224564 -0.42477283  1.017879   -0.28900498]]]
 [[[-0.0895794   0.6173701   0.53645504 -0.15798911 -0.47371125]]]]

On the other hand, with x = 50 and y = 5, the output is:

[[[[ 0.390625 -0.984375  0.140625  0.171875  0.09375 ]]]
 [[[ 0.046875 -0.796875  0.609375  0.4375   -0.140625]]]
 [[[-0.390625 -1.015625  0.5      -0.140625  0.1875  ]]]
 [[[ 0.       -0.21875  -0.25     -0.125    -0.59375 ]]]]

and the expected output:

[[[[ 0.40454012 -0.99046504  0.15005898  0.19190526  0.10348301]]]
 [[[-0.35887206 -0.8224896   0.31811988  0.64625776  0.04913734]]]
 [[[ 0.8148414  -0.4608903   1.3942263   0.28130963 -0.03830071]]]
 [[[ 0.12443601 -1.1203431   0.9184674  -0.00880772 -0.6993512 ]]]]

While the first vector of the real output is not perfect, it is a lot closer to the expected outputs, with differences of magnitude 10e-3 while the other vectors have differences of magnitudes around 10e-1.

I have defined my model using the Keras Sequential API like the following

from tensorflow import keras
from tensorflow.keras import layers

model = keras.models.Sequential([
  keras.Input((x,), name="input0", batch_size=4),
  layers.Dense(y, name="output")
])

After I quantise this model, I compare the results between the quantised Keras model with the above model and they both output similar values, so the issue seems to be arising once we compile the model for use on the DPU.

I suspect this might be an issue about how I am defining my model possibly, since when I run the VART resnet50 example the results are correct for the images I give it.

I am running this on an AWS F1 instance targetting the DPUCADF8H, as well as using the tensorflow2 flow inside the Vitis-AI CPU image. Any help trying to understand why this is happening would be much appreciated!

shua1zhang commented 2 years ago

Hi @brunorochapaiva ,

In your model you are trying to use different batch numbers correct? For DPU IP, I think it's just batch 1 based.

Hi @qianglin-xlnx , Could you help to check this question and confirm?

Thank you.

brunorochapaiva commented 2 years ago

Hi @brunorochapaiva ,

In your model you are trying to use different batch numbers correct? For DPU IP, I think it's just batch 1 based.

Interesting, but vai_c_tensorflow2 will only compile the model if I set the batch size to 4. If I don't specify the batch size as 4, then it throws the following warning when compiling:

...
[UNILOG][WARNING] DPU prefers xir::Op{name = quant_output0(TransferMatMulToConv2d), type = conv2d-fix}'s input batch to be 4, but it's 1 now. So it will be assigned to CPU.
[UNILOG][WARNING] xir::Op{name = quant_output0(TransferMatMulToConv2d), type = conv2d-fix} has been assigned to CPU.
[UNILOG][INFO] Total device subgraph number 2, DPU subgraph number 0
...

One of the DPUCADF8H examples does mention the batch size should be 4: https://github.com/Xilinx/Vitis-AI/tree/master/examples/DPUCADF8H/tf_resnet#compile-the-model And as far as I could tell, when I followed the above example with batch size of 4 it all worked correctly.

paolodalberto commented 2 years ago

Sorry for the late reply, I just received notification very recently. 2.0 should be able to handle any batch size.

Please, notice if the batch size is 1 you will have 1/4 of the performance, I do not think you would like that. Batch size 4 is a must if you like to achieve the peak performance, otherwise the compiler now will let you choose and the run time should be able to handle it as well.

paolodalberto commented 2 years ago

Reading the first message, @brunorochapaiva, the problem is not the batch but I think the reshape first layer. This is an HW optimization where the HW trying to reshape the input to exploit better performance. Thank you for the test case. I will check what the compiler does is such a case and I will ask my colleagues a little help to figure out what the run time.

If you are testing a dense single layer (FC), the compiler should not do any thing. FC is translated into a Convolution 1x1 where the reshape is not applicable. This seems an runtime-HW issue and I think has been resolved recently.

would you mind to share the tensorflow2 model ?

cl600class commented 2 years ago

Hi @paolodalberto , I have run into the same problem, but my model is a CNN with several residual blocks. The input/output sizes are (N,3600,1,1) and (N,1200,1,5). The inference results with N=1 on ZCU102, U250 (Azure), and AWS are correct; however, the outputs with N=4 on both Azure & AWS only match the first element of each batch.

Here is the case of 8 input data: (output_dim=2 is squeezed, the presented data shape is (8, :2, : ) ) N=1 (expected)

[[[ 2.375  0.5   -0.75  -0.5   -1.125]
   [ 3.375 -0.875 -0.75  -1.625  0.125]]

 [[ 4.375  0.25  -1.25  -0.25  -0.75 ]
  [ 3.5    1.75  -1.375  1.375 -2.   ]]

 [[ 3.875  0.5    0.125  0.125 -0.625]
  [ 4.375  1.25  -1.625  0.75  -1.625]]

 [[ 2.375 -1.5   -0.625  0.375 -0.75 ]
  [ 2.375 -2.5    1.    -0.875  0.875]]

 [[ 3.375  1.375 -1.     0.125 -1.625]
  [ 4.125  2.625 -1.5    1.625 -3.125]]

 [[ 4.375 -1.5   -0.5    0.75   0.25 ]
  [ 4.625 -1.25   0.     0.25   0.75 ]]

 [[ 1.875 -1.    -1.25  -0.5   -0.25 ]
  [ 2.25  -1.875 -0.625 -2.25   1.5  ]]

 [[ 3.625 -1.75  -0.625 -1.5    0.625]
  [ 5.25  -2.875  0.    -3.125  3.   ]]]

N=4 (only output 0 & 4 matched)

[[[ 2.375  0.5   -0.75  -0.5   -1.125]
  [ 3.375 -0.875 -0.75  -1.625  0.125]]

 [[ 6.125 -2.25  -0.625 -1.25   0.625]
  [ 7.75  -3.5    0.    -2.25   1.   ]]

 [[ 6.125 -2.25  -0.625 -1.25   0.625]
  [ 7.75  -3.5    0.    -2.25   1.   ]]

 [[ 6.125 -2.25  -0.625 -1.25   0.625]
  [ 7.75  -3.5    0.    -2.25   1.   ]]

 [[ 3.375  1.375 -1.     0.125 -1.625]
  [ 4.125  2.625 -1.5    1.625 -3.125]]

 [[ 6.125 -2.25  -0.625 -1.25   0.625]
  [ 7.75  -3.5    0.    -2.25   1.   ]]

 [[ 6.125 -2.25  -0.625 -1.25   0.625]
  [ 7.75  -3.5    0.    -2.25   1.   ]]

 [[ 6.125 -2.25  -0.625 -1.25   0.625]
  [ 7.75  -3.5    0.    -2.25   1.   ]]]

The compiled model with N=4 for U250 is shown below. dna2d_u250_1

I'm using the FPGAs on cloud platforms where they currently do not provide version 2.0 support. Is there any solution at this point?

paolodalberto commented 2 years ago

I think one of my colleagues is working on this. let me check if it has been pushed

paolodalberto commented 2 years ago

I am asking if this has been resolved.

williamliao28 commented 2 years ago

I have similar issue when testing my own implementation of ResNet18 using TensorFlow 2 with xilinx/vitis-ai-cpu:1.4.1.978 docker on an AWS F1 instance (DPUCADF8H).

paolodalberto commented 2 years ago

Still asking ... I want to see this resolved :)

cl600class commented 2 years ago

Sorry for the late reply. Actually I've posted the issue here, and discussed with @gguasti. We found the input channel size should be 3. Unfortunately the problem still exists after compiling with Vitis AI 2.5 but deployed with a possibly depricated(?) .xclbin on Azure. But I'm not sure how to generate the new .xclbin file properly.

gguasti commented 2 years ago

[AMD Official Use Only - General]

Hello, Please follow PG367https://docs.xilinx.com/r/en-US/pg367-dpucahx8h/Development-Flow. The dpu.xclbin generation is described herehttps://docs.xilinx.com/r/en-US/pg367-dpucahx8h/Generating-the-Bitstream Kind regards, Giovanni

From: cl600class @.> Sent: giovedì 4 agosto 2022 14:46 To: Xilinx/Vitis-AI @.> Cc: Giovanni @.>; Mention @.> Subject: Re: [Xilinx/Vitis-AI] Dense layer only outputs correct values for the first element of each batch when ran on FPGA (#535)

CAUTION: This message has originated from an External Source. Please use proper judgment and caution when opening attachments, clicking links, or responding to this email.

Sorry for the late reply. Actually I've posted the issue herehttps://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fsupport.xilinx.com%2Fs%2Ffeed%2F0D52E00007B6dgHSAR%3Flanguage%3Den_US&data=05%7C01%7Cgiovanni.guasti%40amd.com%7C3f0fd5e1ec9e46951a3a08da7617511e%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637952139769550169%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=BMpgWJKzBtKHIkRHA3NbKla%2BZmlGfycyRDd0uHvnpi4%3D&reserved=0, and discussed with @gguastihttps://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fgguasti&data=05%7C01%7Cgiovanni.guasti%40amd.com%7C3f0fd5e1ec9e46951a3a08da7617511e%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637952139769550169%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=HqNVMSWPTGtQ437lpnrHG%2BlNGGuAH7uQv2quzHVC88Q%3D&reserved=0. We found the input channel size should be 3. Unfortunately the problem still exists after compiling with Vitis AI 2.5 but deployed with a possibly depricated(?) .xclbin on Azure. But I'm not sure how to generate the new .xclbin file properly.

- Reply to this email directly, view it on GitHubhttps://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FXilinx%2FVitis-AI%2Fissues%2F535%23issuecomment-1205206207&data=05%7C01%7Cgiovanni.guasti%40amd.com%7C3f0fd5e1ec9e46951a3a08da7617511e%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637952139769550169%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=Hw21QGZ7NSlOF2kg111GH%2B%2F8ZUMW8xVGdNkOLkEF%2Bzk%3D&reserved=0, or unsubscribehttps://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAKCH4NPHYOHN3BBOPIXKAZLVXO3RHANCNFSM5D75DAVQ&data=05%7C01%7Cgiovanni.guasti%40amd.com%7C3f0fd5e1ec9e46951a3a08da7617511e%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637952139769550169%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=o8HZj186lqtzE5MqbaSmrI3ToraELGA7Sl%2B9LSUp7uA%3D&reserved=0. You are receiving this because you were mentioned.Message ID: @.**@.>>

paolodalberto commented 2 years ago

With channel 3, we should have this nailed from the start :) Please, follow up with @gguasti.

For channel 1, I am still bugging my colleague. I want to have this fix and provide a tested solution ... the delivery of the solution ... a 2.5.1 do not know hopefully not a 3.0

Xilinx / Vitis-AI

Dense layer only outputs correct values for the first element of each batch when ran on FPGA #535