Closed wsleepybear closed 9 months ago
Hello, can you share the shape of x?
您好,您能分享一下x的形状吗?
The shape of x is batch 10 3000 * 64.
Hello, can you share the shape of x?
When executing the code loss.backward(), an issue occurs. The criterion is defined as nn.BCELoss().to(device).
Can you share the hardware you use? The code has only been test under A100/A800.
Can you share the hardware you use? The code has only been test under A100/A800.
NVIDIA GeForce RTX 3090
I think this may be related to the size of the shared memory. To run the code temporarily, you can reduce the BLOCK at lines 410 and 454, BLOCK_MODEL at line 413, and CBLOCK at line 457 in lightning_attn/ops/triton/lightning_attn2.py. Please note that Triton only supports block sizes that are multiples of 16, so you can adjust them to 16 or 32.
Thank you. After making the adjustments, I am able to run two or more instances of lightning_attn_func
successfully.
I'm glad this was helpful. Could you share the BLOCK size you used on the 3090? This information might be beneficial for future users.
I'm glad this was helpful. Could you share the BLOCK size you used on the 3090? This information might be beneficial for future users.
I have set BLOCK=32, CBLOCK=16, and BLOCK_MODEL=16. Under this configuration, I can run at least 6 instances of lightning_attn_func
without encountering any issues. I haven't attempted more instances yet.
Thank you for your generous sharing!
"In training, when I run lightning_attn_func two or more times, I encounter an exception with the content “triton.runtime.autotuner.OutOfResources: out of resource: shared memory, Required: 114688, Hardware limit: 101376. Reducing block sizes or
num_stages
may help.” The partial code snippet of my execution is as followsx = self.norm(self.attention(x,x,x,_build_slope_tensor(self.num_heads).to(x.device).to(torch.float32))) x = self.norm(self.ff(x))+x x = lightning_attn_func(x,x,x,_build_slope_tensor(self.num_heads).to(x.device).to(torch.float32))