NVIDIA / TransformerEngine

A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs, to provide better performance with lower memory utilization in both training and inference.
https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/index.html
Apache License 2.0
1.61k stars 256 forks source link

The precision is not aligned by the index #909

Open Amanda-Barbara opened 3 weeks ago

Amanda-Barbara commented 3 weeks ago

Hello, When I use thd qkv_format with CP to run 128*1024 sequence length with 2xA800 gpus on run_fused_attn_with_cp.py script and compare the result just using flash-attn2 algorithm. I found that the gathered result with CP are not same with flash-attn2 result, but the result are same on mismatched index, like this: (Pdb) p torch.allclose(gathered_thd_out_with_cp[27950], out[27950], tols) True (Pdb) p torch.allclose(gathered_thd_out_with_cp[27951], out[27951], tols) False (Pdb) p torch.allclose(gathered_thd_out_with_cp[27951], out[83853], tols) True (Pdb) p torch.allclose(gathered_thd_out_with_cp[55902], out[27951], tols) True (Pdb) p torch.allclose(gathered_thd_out_with_cp[83853], out[55902], **tols) True (Pdb) p max(abs(gathered_thd_out_with_cp[27951] - out[83853])) tensor(1.5259e-05, device='cuda:0', dtype=torch.float16, p max(abs(gathered_thd_out_with_cp[27951] - out[83853])) tensor(0.0001, device='cuda:0', dtype=torch.bfloat16, grad_fn=)

I print the parameter like this: current dtype:bf16 [INFO] world_size:2, rank:0 qkv_format:thd seqlens_q:tensor([111804], dtype=torch.int32) seqlens_q after second op:tensor([111804], dtype=torch.int32) cu_seqlens_q:tensor([ 0, 111804]), cu_seqlens_q.device:cpu q_input_shape:(tensor(111804), 40, 128) kv_input_shape:(tensor(111804), 40, 128) attn_output_shape:(tensor(111804), 5120) cu_seqlens_q.device:cpu current dtype:bf16 [INFO] world_size:2, rank:1 qkv_format:thd seqlens_q:tensor([111804], dtype=torch.int32) seqlens_q after second op:tensor([111804], dtype=torch.int32) cu_seqlens_q:tensor([ 0, 111804]), cu_seqlens_q.device:cpu q_input_shape:(tensor(111804), 40, 128) kv_input_shape:(tensor(111804), 40, 128) attn_output_shape:(tensor(111804), 5120) cu_seqlens_q.device:cpu seq_idx_q:tensor([ 0, 1, 2, ..., 111801, 111802, 111803], device='cuda:0', dtype=torch.int32), seq_idx_q.device:cuda:0 seq_idx_kv:tensor([ 0, 1, 2, ..., 111801, 111802, 111803], device='cuda:0', dtype=torch.int32), seq_idx_kv.device:cuda:0 seq_idx_q:tensor([27951, 27952, 27953, ..., 83850, 83851, 83852], device='cuda:1', dtype=torch.int32), seq_idx_q.device:cuda:1 seq_idx_kv:tensor([27951, 27952, 27953, ..., 83850, 83851, 83852], device='cuda:1', dtype=torch.int32), seq_idxkv.device:cuda:1 q.shape:torch.Size([55902, 40, 128]), q.device:cuda:0 dout.shape:torch.Size([55902, 5120]), dout.device:cuda:0 k.shape:torch.Size([55902, 40, 128]), k.device:cuda:0 v.shape:torch.Size([55902, 40, 128]), v.device:cuda:0 q.shape:torch.Size([55902, 40, 128]), q.device:cuda:1 dout.shape:torch.Size([55902, 5120]), dout.device:cuda:1 k.shape:torch.Size([55902, 40, 128]), k.device:cuda:1 v.shape:torch.Size([55902, 40, 128]), v_.device:cuda:1 cu_seqlens_q:tensor([ 0, 55902], device='cuda:0', dtype=torch.int32), cu_seqlens_q.device:cuda:0 cu_seqlens_q:tensor([ 0, 55902], device='cuda:1', dtype=torch.int32), cu_seqlens_q.device:cuda:1 cu_seqlens_kv:tensor([ 0, 55902], device='cuda:0', dtype=torch.int32), cu_seqlens_kv.device:cuda:0 cu_seqlens_kv:tensor([ 0, 55902], device='cuda:1', dtype=torch.int32), cu_seqlens_kv.device:cuda:1 I'm not sure what the problem is.How to solve it? Thaks.