Open zhdllwyc opened 6 days ago
Thank you for the issue. We're looking into it and will get back with you as soon as we have something to share.
That still looks suspiciously like DGE. What are the results if you add --disable-internal-io-dge here:
nki_matmul_basic_jit = nki_jit(nki_matmul_tiled_, additional_compile_opt="--disable-internal-io-dge")
nki_matmul_basic_jit(lhs_big.T, rhs_big, output_big)
I compiled the kernel and looked at the IR output and do not see any additional tensors that would be causing this behavior. However, when I look at the neff I see a bunch of GPSIMD instructions when DGE is on, and not when it's off and as far as I can tell that's the most likely culprit.
We took a closer look at the issue, and we suspect there is a bug with the profiler or compiler. We are investigating this and will let you know once we have a conclusion.
That still looks suspiciously like DGE. What are the results if you add --disable-internal-io-dge here:
nki_matmul_basic_jit = nki_jit(nki_matmul_tiled_, additional_compile_opt="--disable-internal-io-dge") nki_matmul_basic_jit(lhs_big.T, rhs_big, output_big)
Hi, thank you for looking in to this issue.
When doing this. I receive the following error message: __init__() got an unexpected keyword argument 'additional_compile_opt'
I believe I do not have access to the compiler that recognizes this argument.
We observe non-uniform SBUF utitlization in neuron-profile on both trn1.32xlarge and trn1.2xlarge. I follow the NKI tutorial, and try to launch matrix multiplication by breaking matrices into tile. In the source code, we load left and right hand side matrices into SBUF by breaking them into tiles: lhsT_tile and rhs_tile. In NKI architecture guide, Figure 28 shows that a tensor should span tensor.shape[0] number of SBUF partition. By clearly specifying the K-dimension (K=128) as the partition dimension (see source code below), we expect the lhsT_tile and rhs_tile occupy the SBUF uniformly. However, that is not what I observed after profiling. Here goes a snapshot of the SBUF partition utilization I observed:
Below is my source code:
My pip freeze is:
My neuron-profile version is:
When profiling, I output the profile result of the second iteration: