OpenBMB / BMTrain

Efficient Training (including pre-training and fine-tuning) for Big Models
Apache License 2.0
548 stars 74 forks source link

[BUG] Tensor Parallel async_chunk=4 mismatch async_chunk=1 result when sequence length longer than 16K #174

Open Achazwl opened 10 months ago

Achazwl commented 10 months ago

Is there an existing issue for this?

Description of the Bug

TP linear async_chunk=4 mismatch async_chunk=1 result when sequence length longer than 16K, but match when <= 8K.

Environment Information

- GCC version: 7.5.0
- Torch version: 1.13.1
- Linux system version: Ubuntu 18.04.6 LTS
- CUDA version: 11.6
- Torch's CUDA version (as per `torch.cuda.version()`): 11.6

To Reproduce

CUDA_LAUNCH_BLOCKING can fix this

Expected Behavior

match

Screenshots

No response

Additional Information

No response

Confirmation