OpenNLPLab / lightning-attention

Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Models
MIT License
184 stars 15 forks source link

When running lightning_attn_func two or more times, an error occurred. #4

Closed wsleepybear closed 9 months ago

wsleepybear commented 9 months ago

"In training, when I run lightning_attn_func two or more times, I encounter an exception with the content “triton.runtime.autotuner.OutOfResources: out of resource: shared memory, Required: 114688, Hardware limit: 101376. Reducing block sizes or num_stages may help.” The partial code snippet of my execution is as follows x = self.norm(self.attention(x,x,x,_build_slope_tensor(self.num_heads).to(x.device).to(torch.float32))) x = self.norm(self.ff(x))+x x = lightning_attn_func(x,x,x,_build_slope_tensor(self.num_heads).to(x.device).to(torch.float32))

Doraemonzzz commented 9 months ago

Hello, can you share the shape of x?

wsleepybear commented 9 months ago

您好,您能分享一下x的形状吗?

The shape of x is batch 10 3000 * 64.

wsleepybear commented 9 months ago

Hello, can you share the shape of x?

When executing the code loss.backward(), an issue occurs. The criterion is defined as nn.BCELoss().to(device).

Doraemonzzz commented 9 months ago

Can you share the hardware you use? The code has only been test under A100/A800.

wsleepybear commented 9 months ago

Can you share the hardware you use? The code has only been test under A100/A800.

NVIDIA GeForce RTX 3090

Doraemonzzz commented 9 months ago

I think this may be related to the size of the shared memory. To run the code temporarily, you can reduce the BLOCK at lines 410 and 454, BLOCK_MODEL at line 413, and CBLOCK at line 457 in lightning_attn/ops/triton/lightning_attn2.py. Please note that Triton only supports block sizes that are multiples of 16, so you can adjust them to 16 or 32.

wsleepybear commented 9 months ago

Thank you. After making the adjustments, I am able to run two or more instances of lightning_attn_func successfully.

Doraemonzzz commented 9 months ago

I'm glad this was helpful. Could you share the BLOCK size you used on the 3090? This information might be beneficial for future users.

wsleepybear commented 9 months ago

I'm glad this was helpful. Could you share the BLOCK size you used on the 3090? This information might be beneficial for future users.

I have set BLOCK=32, CBLOCK=16, and BLOCK_MODEL=16. Under this configuration, I can run at least 6 instances of lightning_attn_func without encountering any issues. I haven't attempted more instances yet.

Doraemonzzz commented 9 months ago

Thank you for your generous sharing!