Open shaharbar1 opened 9 months ago
Consider wrapping the call to self.attention in InterpretableMultiHeadAttention with with torch.backends.cuda.sdp_kernel(enable_flash=True, enable_math=False, enable_mem_efficient=True): In order to improve speed and memory efficiency.
with torch.backends.cuda.sdp_kernel(enable_flash=True, enable_math=False, enable_mem_efficient=True):
no https://github.com/pytorch/pytorch/issues/125674
Consider wrapping the call to self.attention in InterpretableMultiHeadAttention with
with torch.backends.cuda.sdp_kernel(enable_flash=True, enable_math=False, enable_mem_efficient=True):
In order to improve speed and memory efficiency.