NVIDIA / TransformerEngine

A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs, to provide better performance with lower memory utilization in both training and inference.
Apache License 2.0
1.61k stars 256 forks source link

The precision is not aligned by the index #909

Open Amanda-Barbara opened 3 weeks ago

Amanda-Barbara commented 3 weeks ago

Hello, When I use thd qkv_format with CP to run 128*1024 sequence length with 2xA800 gpus on run_fused_attn_with_cp.py script and compare the result just using flash-attn2 algorithm. I found that the gathered result with CP are not same with flash-attn2 result, but the result are same on mismatched index, like this: (Pdb) p torch.allclose(gathered_thd_out_with_cp[27950], out[27950], tols) True (Pdb) p torch.allclose(gathered_thd_out_with_cp[27951], out[27951], tols) False (Pdb) p torch.allclose(gathered_thd_out_with_cp[27951], out[83853], tols) True (Pdb) p torch.allclose(gathered_thd_out_with_cp[55902], out[27951], tols) True (Pdb) p torch.allclose(gathered_thd_out_with_cp[83853], out[55902], **tols) True (Pdb) p max(abs(gathered_thd_out_with_cp[27951] - out[83853])) tensor(1.5259e-05, device='cuda:0', dtype=torch.float16, p max(abs(gathered_thd_out_with_cp[27951] - out[83853])) tensor(0.0001, device='cuda:0', dtype=torch.bfloat16, grad_fn=)

I print the parameter like this: current dtype:bf16 [INFO] world_size:2, rank:0 qkv_format:thd seqlens_q:tensor([111804], dtype=torch.int32) seqlens_q after second op:tensor([111804], dtype=torch.int32) cu_seqlens_q:tensor([ 0, 111804]), cu_seqlens_q.device:cpu q_input_shape:(tensor(111804), 40, 128) kv_input_shape:(tensor(111804), 40, 128) attn_output_shape:(tensor(111804), 5120) cu_seqlens_q.device:cpu current dtype:bf16 [INFO] world_size:2, rank:1 qkv_format:thd seqlens_q:tensor([111804], dtype=torch.int32) seqlens_q after second op:tensor([111804], dtype=torch.int32) cu_seqlens_q:tensor([ 0, 111804]), cu_seqlens_q.device:cpu q_input_shape:(tensor(111804), 40, 128) kv_input_shape:(tensor(111804), 40, 128) attn_output_shape:(tensor(111804), 5120) cu_seqlens_q.device:cpu seq_idx_q:tensor([ 0, 1, 2, ..., 111801, 111802, 111803], device='cuda:0', dtype=torch.int32), seq_idx_q.device:cuda:0 seq_idx_kv:tensor([ 0, 1, 2, ..., 111801, 111802, 111803], device='cuda:0', dtype=torch.int32), seq_idx_kv.device:cuda:0 seq_idx_q:tensor([27951, 27952, 27953, ..., 83850, 83851, 83852], device='cuda:1', dtype=torch.int32), seq_idx_q.device:cuda:1 seq_idx_kv:tensor([27951, 27952, 27953, ..., 83850, 83851, 83852], device='cuda:1', dtype=torch.int32), seq_idxkv.device:cuda:1 q.shape:torch.Size([55902, 40, 128]), q.device:cuda:0 dout.shape:torch.Size([55902, 5120]), dout.device:cuda:0 k.shape:torch.Size([55902, 40, 128]), k.device:cuda:0 v.shape:torch.Size([55902, 40, 128]), v.device:cuda:0 q.shape:torch.Size([55902, 40, 128]), q.device:cuda:1 dout.shape:torch.Size([55902, 5120]), dout.device:cuda:1 k.shape:torch.Size([55902, 40, 128]), k.device:cuda:1 v.shape:torch.Size([55902, 40, 128]), v_.device:cuda:1 cu_seqlens_q:tensor([ 0, 55902], device='cuda:0', dtype=torch.int32), cu_seqlens_q.device:cuda:0 cu_seqlens_q:tensor([ 0, 55902], device='cuda:1', dtype=torch.int32), cu_seqlens_q.device:cuda:1 cu_seqlens_kv:tensor([ 0, 55902], device='cuda:0', dtype=torch.int32), cu_seqlens_kv.device:cuda:0 cu_seqlens_kv:tensor([ 0, 55902], device='cuda:1', dtype=torch.int32), cu_seqlens_kv.device:cuda:1 I'm not sure what the problem is.How to solve it? Thaks.