Open Fan-Yixuan opened 11 months ago
Hi,
Which version of xformers are you using / which GPU? Can you report the output of python -m xformers.info
.
We had a bug in the earlier versions of xFormers, where enabling dropout could cause bad numerics (so I would try with dropout disabled just in case).
Also worth noting, you can replace the following:
x = x.unflatten(-1, (3, self.embed_dim)).unsqueeze(0).transpose(0, -2).squeeze(-2)
q, k, v = x[0], x[1], x[2]
q = q.reshape(1, length, self.num_heads, self.head_dim)
k = k.reshape(1, length, self.num_heads, self.head_dim)
v = v.reshape(1, length, self.num_heads, self.head_dim)
With something like that, which is going to be a bit more efficient for the BW pass:
x = x.reshape(1, length, 3, self.num_heads, self.head_dim)
q, k, v = xops.unbind(x, 2)
Hi @danthe3rd, Thanks for the comment, my env:
xFormers 0.0.23.dev703
memory_efficient_attention.cutlassF: available
memory_efficient_attention.cutlassB: available
memory_efficient_attention.decoderF: available
memory_efficient_attention.flshattF@v2.3.5-1-gce3e728: available
memory_efficient_attention.flshattB@v2.3.5-1-gce3e728: available
memory_efficient_attention.smallkF: available
memory_efficient_attention.smallkB: available
memory_efficient_attention.tritonflashattF: unavailable
memory_efficient_attention.tritonflashattB: unavailable
memory_efficient_attention.triton_splitKF: available
indexing.scaled_index_addF: available
indexing.scaled_index_addB: available
indexing.index_select: available
swiglu.dual_gemm_silu: available
swiglu.gemm_fused_operand_sum: available
swiglu.fused.p.cpp: available
is_triton_available: True
pytorch.version: 2.1.1
pytorch.cuda: available
gpu.compute_capability: 7.5
gpu.name: NVIDIA GeForce RTX 2080 Ti
build.info: available
build.cuda_version: 1108
build.python_version: 3.9.18
build.torch_version: 2.1.1
build.env.TORCH_CUDA_ARCH_LIST: 5.0+PTX 6.0 6.1 7.0 7.5 8.0+PTX 9.0
build.env.XFORMERS_BUILD_TYPE: Release
build.env.XFORMERS_ENABLE_DEBUG_ASSERTIONS: None
build.env.NVCC_FLAGS: None
build.env.XFORMERS_PACKAGE_FROM: conda-main
build.nvcc_version: 11.8.89
source.privacy: open source
and disable dropout really works! Thanks a lot and How can I fix this and can use dropout again.
Hi @danthe3rd, disabling dropout really worked, but how can I use dropout again? Thanks a lot!
Hey! I am seeing a very similar problem where the loss starts going up with xformers whenever dropout > 0
. Everything is good when dropout == 0.0
. Additionally, things are also good when forcing MemoryEfficientAttentionFlashAttentionOp
dispatch, even when dropout > 0
.
So I guess this is a bug in the Cutlass kernel?
❓ Questions and Help
I'm new to xformers. I need to use Transformer Encoders to train on a dataset with a very large variation in sample lengths. My original code was:
where mapper records which sample is each token comes from, layer is pytorch transformer encoder layer I changed it into xformers:
here we use this for attention layer:
Once training starts (x coordinate of the figure below is training steps), the original training loss curve and the xformers-version training loss curve are shown below. (yellow for original, green for xformers). Nothing else is changed.
My env: python 3.9, PyTorch 2.1.1, cuda 11.8, latest xformers. Similar phenomenons are observed in other envs and different gpu cards, whether single card training or ddp.
Please help me with this! Thanks a lot in advance! padding is really wasting my training time!!