Open FrancescoSaverioZuppichini opened 1 year ago
Your batch size is very small (1), and you might be CPU-bound. You should try increasing it:
B = 64 # Or lower if you get OOMs
x = torch.randn((B, 3, 224, 224), device="cuda").half()
This should give you some speedup. You can also further improve the speed by avoiding the transpose calls, something like that:
def forward(self, x, *args, **kwargs):
x = x.permute(1,0,2)
B, N, C = x.shape
qkv = (
self.qkv(x)
.reshape(B, N, 3, self.num_heads, C // self.num_heads)
)
q, k, v = qkv.unbind(2)
x = memory_efficient_attention_forward(q, k, v, op=None)
# x = x.reshape(B, self.num_heads, N, C // self.num_heads).transpose(1, 2)
x = x.reshape(B, N, C)
x = self.proj(x)
x = self.proj_drop(x)
return x.permute(1,0,2)
@danthe3rd thanks a lot for the reply, I've updated the code but I cannot see any speed up. batch_size=64
<torch.utils.benchmark.utils.common.Measurement object at 0x7f99d81e9460>
profile
Median: 248.12 ms
IQR: 0.32 ms (248.02 to 248.35)
9 measurements, 1 runs per measurement, 1 thread
Memory used: 1432.09716796875 MB
CLIP xformers
<torch.utils.benchmark.utils.common.Measurement object at 0x7f99d8026640>
profile
Median: 231.34 ms
IQR: 1.04 ms (231.02 to 232.06)
9 measurements, 1 runs per measurement, 1 thread
Memory used: 1432.09716796875 MB
Can you report the output of this command:
python -m xformers.info
@danthe3rd
(dl) ➜ ~ python -m xformers.info
xFormers 0.0.16
memory_efficient_attention.cutlassF: available
memory_efficient_attention.cutlassB: available
memory_efficient_attention.flshattF: available
memory_efficient_attention.flshattB: available
memory_efficient_attention.smallkF: available
memory_efficient_attention.smallkB: available
memory_efficient_attention.tritonflashattF: available
memory_efficient_attention.tritonflashattB: available
swiglu.fused.p.cpp: available
is_triton_available: True
is_functorch_available: False
pytorch.version: 1.13.1+cu117
pytorch.cuda: available
gpu.compute_capability: 8.6
gpu.name: NVIDIA GeForce RTX 3090
build.info: available
build.cuda_version: 1107
build.python_version: 3.9.16
build.torch_version: 1.13.1+cu117
build.env.TORCH_CUDA_ARCH_LIST: 5.0+PTX 6.0 6.1 7.0 7.5 8.0 8.6
build.env.XFORMERS_BUILD_TYPE: Release
build.env.XFORMERS_ENABLE_DEBUG_ASSERTIONS: None
build.env.NVCC_FLAGS: None
build.env.XFORMERS_PACKAGE_FROM: wheel-v0.0.16
source.privacy: open source
I ran your script on my machine (with A100 GPU) and got a nice speedup:
torch.Size([64, 768])
<torch.utils.benchmark.utils.common.Measurement object at 0x7f2a13d94460>
profile
Median: 160.57 ms
IQR: 0.02 ms (160.55 to 160.58)
13 measurements, 1 runs per measurement, 1 thread
Memory used: 1432.09716796875 MB
<torch.utils.benchmark.utils.common.Measurement object at 0x7f2a13c558e0>
profile
Median: 101.79 ms
IQR: 0.06 ms (101.77 to 101.83)
20 measurements, 1 runs per measurement, 1 thread
Memory used: 1432.09716796875 MB
torch.Size([64, 768])
Maybe you can try with the latest xformers development version (pip install --pre -U xformers
)?
NOTE: I modified the script slightly:
I've updated, I still don't see any real difference
<torch.utils.benchmark.utils.common.Measurement object at 0x7f364f553bb0>
profile
Median: 247.37 ms
IQR: 0.72 ms (246.91 to 247.63)
9 measurements, 1 runs per measurement, 1 thread
Memory used: 1432.09716796875 MB
CLIP xformers
<torch.utils.benchmark.utils.common.Measurement object at 0x7f364f5c1ee0>
profile
Median: 229.60 ms
IQR: 1.31 ms (229.01 to 230.33)
9 measurements, 1 runs per measurement, 1 thread
Memory used: 1432.09716796875 MB
Memory is the same, which torch
version are you using?
This is so interesting ahahha maybe xformers
doesn't play nice with my 3090? Which driver are you running?
You still have some 7% speedup.
Memory might not differ much because you run in no_grad
mode, and your sequences are not that long (257
).
maybe xformers doesn't play nice with my 3090
Yes it's possible. Different GPUs have different characteristics, and our kernels have been mostly optimized for V100/A100 as that's what we use internally for research.
You still have some 7% speedup.
Still a win 🥳
Memory might not differ much because you run in no_grad mode, and your sequences are not that long (257). Tried using
memory_efficient_attention
, there is indeed some memory saving
profile
Median: 248.75 ms
IQR: 0.69 ms (248.62 to 249.31)
9 measurements, 1 runs per measurement, 1 thread
Memory used: 21231.14501953125 MB
CLIP xformers
<torch.utils.benchmark.utils.common.Measurement object at 0x7fde79d89d60>
profile
Median: 232.57 ms
IQR: 0.98 ms (232.18 to 233.17)
9 measurements, 1 runs per measurement, 1 thread
Memory used: 18147.27001953125 MB
Yes it's possible. Different GPUs have different characteristics, and our kernels have been mostly optimized for V100/A100 as that's what we use internally for research. Any resource on that?
Moreover, is there an optimization guide? Like, where is best to some some attention compared to others? I am happy to lear more about and contribute with articles and blogs
❓ Questions and Help
Hi guys,
Thanks a lot for the amazing work. I am trying to use
xformers
on CLIP, following thetimm
tutorial I've put together the following codeIt outputs
So basically, no change. Am I doing something wrong?
Thanks a lot,
Fra