TP linear async_chunk=4 mismatch async_chunk=1 result when sequence length longer than 16K, but match when <= 8K.
Environment Information
- GCC version: 7.5.0
- Torch version: 1.13.1
- Linux system version: Ubuntu 18.04.6 LTS
- CUDA version: 11.6
- Torch's CUDA version (as per `torch.cuda.version()`): 11.6
To Reproduce
CUDA_LAUNCH_BLOCKING can fix this
Expected Behavior
match
Screenshots
No response
Additional Information
No response
Confirmation
[ ] I have reviewed and verified all the information provided in this report.
Is there an existing issue for this?
Description of the Bug
TP linear async_chunk=4 mismatch async_chunk=1 result when sequence length longer than 16K, but match when <= 8K.
Environment Information
To Reproduce
CUDA_LAUNCH_BLOCKING can fix this
Expected Behavior
match
Screenshots
No response
Additional Information
No response
Confirmation