Bug in tensor core programming

I encountered a strange bug while programming tensor core using the WMMA api in A800. I tried to print the size of the element in the fragment，Normally sizeof(fp16) is 2, the following code also outputs 2.

    wmma::load_matrix_sync(a_frag, a , lda);
    wmma::load_matrix_sync(b_frag, b , ldb);
    wmma::mma_sync(acc_frag, a_frag, b_frag, acc_frag);

    printf("%d\n", sizeof(a_frag.x[0]));
    printf("%d\n", sizeof(a_frag.x[1]));

However, I changed the code to the following form, the print statement output 2 and 0, even i changed the order of a_frag.x[0] and a_frag.x[1], the output still is 2 and 0, Does anyone know why?

    wmma::load_matrix_sync(a_frag, a , lda);
    wmma::load_matrix_sync(b_frag, b , ldb);
    wmma::mma_sync(acc_frag, a_frag, b_frag, acc_frag);

    printf("%d %d\n", sizeof(a_frag.x[0]), sizeof(a_frag.x[1]));

My code runs in the following environment, os is ubuntu 20.04.

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.154.05             Driver Version: 535.154.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A800 80GB PCIe          Off | 00000000:03:00.0 Off |                    0 |
| N/A   43C    P0              68W / 300W |   2768MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

My compile command is:

nvcc tensor_core.cu -std=c++11 -lcublas -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80 -res-usage -lcudart  -lineinfo -Xcompiler -fopenmp

NVIDIA / cuda-samples

Bug in tensor core programming #251