Open zen-d opened 1 year ago
Hi, Thank for opening this issue. That's something we can work on (see https://github.com/facebookresearch/xformers/issues/683). What type of bias do you need? Is it a learnable bias
@danthe3rd Thanks a lot for your prompt reply! #683 is highly related. In that thread I notice you may work on it https://github.com/facebookresearch/xformers/issues/683#issuecomment-1458153308.
First, may I know when the support for a attn_bias
of torch.Tensor
with attn_bias.shape[-1] % 8 != 0
is scheduled? Would it be a very recent plan?
Second, if you could also support a learnable attn_bias
, it would become more attractive.
The bias is currently learnable :) We just need to add this padding support. Hopefully we can get that out next week
Wow, fantastic! Look forward to seeing the padding support soon to relax the shape constraint.
@danthe3rd Thanks! Looks good, but I don't have free GPUs temporarily. I will try on the new feature ASAP.
@danthe3rd By following these hints to do padding and slicing, I'm able to run the model now. The memory burden is significantly alleviated. Thanks for your awesome job! I will continue to monitor the training process and the final accuracy.
HINT: To use an
attn_bias
with a sequence length that is not a multiple of 8, \ you need to ensure memory is aligned by slicing a bigger tensor. \ Example: useattn_bias = torch.zeros([1, 1, 5, 8])[:,:,:,:5]
instead oftorch.zeros([1, 1, 5, 5])
Unfortunately, the training diverges in the middle (loss becomes NaN), which did not happen in the original attention-based model. Would you like to share some insights about that? Thanks.
I don't have specific idea for this, but you can detect more precisely where the nan is coming from with the anomaly detection:
torch.autograd.set_detect_anomaly(mode=True, check_nan=True)
Thanks for providing the suggestion. The only difference is the attention implementation in this controlled experiment, but I am not sure of the specific reason temporarily. I will dive deep into the issue. :)
Also - this is running in f32 it looks like? Otherwise you might want to try to train with f32 to see if it's related to the numerical precision
Yes, for safety, I am training with FP32 numerical precision now. (Similar to my experience, AMP training seems to have more chance of NaN for Transformer-based models.)
Yes, for safety, I am training with FP32 numerical precision now. (Similar to my experience, AMP training seems to have more chance of NaN for Transformer-based models.)
I meet the same question, and i found that use fp16 can solve this problem.
@danthe3rd By following these hints to do padding and slicing, I'm able to run the model now. The memory burden is significantly alleviated. Thanks for your awesome job! I will continue to monitor the training process and the final accuracy.
HINT: To use an
attn_bias
with a sequence length that is not a multiple of 8, you need to ensure memory is aligned by slicing a bigger tensor. Example: useattn_bias = torch.zeros([1, 1, 5, 8])[:,:,:,:5]
instead oftorch.zeros([1, 1, 5, 5])
Hi, I found that use this method may cause the inference speed lower.#853 Do you have any good way?
❓ Questions and Help
Hi, I pass in the
attn_bias
toxformers.ops.memory_efficient_attention
, but meet the following errorIn my case,
attn_bias
is indispensable and it is hard to always satisfy thatattn_bias.shape[-1] % 8 == 0
, so how could I benefit from this repo? Thanks.