NVIDIA / cuda-samples

Samples for CUDA Developers which demonstrates features in CUDA Toolkit
Other
6.47k stars 1.83k forks source link

Bug in tensor core programming #251

Open blueWatermelonFri opened 9 months ago

blueWatermelonFri commented 9 months ago

I encountered a strange bug while programming tensor core using the WMMA api in A800. I tried to print the size of the element in the fragment,Normally sizeof(fp16) is 2, the following code also outputs 2.

    wmma::load_matrix_sync(a_frag, a , lda);
    wmma::load_matrix_sync(b_frag, b , ldb);
    wmma::mma_sync(acc_frag, a_frag, b_frag, acc_frag);

    printf("%d\n", sizeof(a_frag.x[0]));
    printf("%d\n", sizeof(a_frag.x[1]));

However, I changed the code to the following form, the print statement output 2 and 0, even i changed the order of a_frag.x[0] and a_frag.x[1], the output still is 2 and 0, Does anyone know why?

    wmma::load_matrix_sync(a_frag, a , lda);
    wmma::load_matrix_sync(b_frag, b , ldb);
    wmma::mma_sync(acc_frag, a_frag, b_frag, acc_frag);

    printf("%d %d\n", sizeof(a_frag.x[0]), sizeof(a_frag.x[1]));

My code runs in the following environment, os is ubuntu 20.04.

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.154.05             Driver Version: 535.154.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A800 80GB PCIe          Off | 00000000:03:00.0 Off |                    0 |
| N/A   43C    P0              68W / 300W |   2768MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

My compile command is:

nvcc tensor_core.cu -std=c++11 -lcublas -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 -res-usage -lcudart  -lineinfo -Xcompiler -fopenmp
yuantailing commented 5 months ago

Hello @blueWatermelonFri,

The format specifier %d expects an argument of type int, not size_t. The correct version of the code is

printf("%d %d\n", (int)sizeof(a_frag.x[0]), (int)sizeof(a_frag.x[1]));

I guess the reason for the output of 2 and 0 may be that size_t takes 8 bytes, i.e., 0x00000002, so 0x0002 takes the place of the first %d, and 0x0000 takes the place of the second %d.

Additionally, you may have received a warning like:

warning #181-D: argument is incompatible with corresponding format string conversion (expected type "int" but argument has type "unsigned long long")