Open nandeeka opened 1 month ago
Thank @nandeeka for filing the issue. We will take a look.
Hi Nandeeka, I’m taking a look to reproduce. If you have one could you also provide the contents of the compiler log?
Hi Jonathan, Where do I see the compiler log? Following the instructions here, I tried printing it to the console with:
export NEURON_RT_LOG_LOCATION=console
export NEURON_RT_LOG_LEVEL=INFO
But this does not seem to have done anything. Thanks!
Hi @nandeeka will you try adding adding additional_compile_opt="--verbose debug"
argument to the baremetal decorator?
This worked. It looks like the error was:
2024-10-05T22:49:01Z ERROR 3808 [job.WalrusDriver.0]: Backend exited with code -6 and stderr: No existing axis k2 found in instruction I-33's parent list
walrus_driver: /local/p4clients/pkgbuild-const/workspace/src/KaenaCompiler/neuronxcc/walrus/ir/lib/IR/BasicBlockHolder.cpp:150: bir::LoopAxis* bir::BasicBlockHolder::findAxis(const string&, bir::Instruction*): Assertion `false && "No existing axis found"' failed.
After inspecting all instructions involving k2
, I figured out which one was creating the problem, and I fixed it. I guess my remaining question is, is there any way for me to figure out which instruction was instruction I-33
? As kernels get bigger, manually inspecting all relevant instructions becomes more and more challenging.
This actually works with the simulator as is, so will need to look further why it's correct at the nki insertion point but incorrect in the backend:
updated code:
def test_lora(self):
K, M, N, R = (4096, 4096, 2048, 8)
K0 = 128
M0 = 128
N0 = 512
M1 = 4
N1 = 4
K1 = 8
K2 = K // (K1 * K0)
M2 = M // (M1 * M0)
N2 = N // (N1 * N0)
assert K2 * K1 * K0 == K
assert M2 * M1 * M0 == M
assert N2 * N1 * N0 == N
PW = np.random.random_sample([M2, K2, M1, K0, K1, M0]).astype(np.float16)
I = np.random.random_sample([K, N]).astype(np.float16)
A = np.random.random_sample([K, R]).astype(np.float16)
SB = np.random.random_sample([R, K]).astype(np.float16)
O = np.ndarray(shape=[M, N], dtype=np.float16)
nki.simulate_kernel(lora, I, PW, A, SB, O, K2, K1, K0, M2, M1, M0, N2, N1, N0, R)
print(O[0,0])
return I, PW, A, SB, O
output:
4890.0
This worked. It looks like the error was:
2024-10-05T22:49:01Z ERROR 3808 [job.WalrusDriver.0]: Backend exited with code -6 and stderr: No existing axis k2 found in instruction I-33's parent list walrus_driver: /local/p4clients/pkgbuild-const/workspace/src/KaenaCompiler/neuronxcc/walrus/ir/lib/IR/BasicBlockHolder.cpp:150: bir::LoopAxis* bir::BasicBlockHolder::findAxis(const string&, bir::Instruction*): Assertion `false && "No existing axis found"' failed.
After inspecting all instructions involving
k2
, I figured out which one was creating the problem, and I fixed it. I guess my remaining question is, is there any way for me to figure out which instruction wasinstruction I-33
? As kernels get bigger, manually inspecting all relevant instructions becomes more and more challenging.
I-33 would be the 33rd instruction emitted by the Kernel.
As far as a better way to see which instruction maps to what line of code, we should be able to re-correlate it back to the debug info for the kernel. I am adding this to our backlog to make it more clear what went wrong.
I am trying to run what I think should be a kernel. However, I am getting the opaque error message,
[F134] neuronx-cc terminated abnormally
. What is the error and/or how do I go about debugging an error message like this?The full kernel is:
The full error message is:
My pip freeze is: