Closed bvandermoon closed 1 month ago
Notes
Update attn_mask_type so that it is not ignored. Related to issue 878.
Note that this change currently only impacts workloads using
cudnn_flash_te
attention.Testing
Trained 10 steps on GPUs with
cudnn_flash_te
enabled. Saw the same high-level output (step time, loss, etc.) before and after this change.Note: The loss was suspicious since it dropped from 10.307 to 0 after the third step. But this happened before and after this change so it appears to be caused by something else
I assume the loss is dropping so quickly because you are using synthetic data (dataset_type=synthetic
), you can use real data to measure the loss. Ideally we would test correctness with something like our golden logits test
Notes
Update attn_mask_type so that it is not ignored. Related to issue 878. Note that this change currently only impacts workloads using
cudnn_flash_te
attention.Testing
Trained 10 steps on GPUs with
cudnn_flash_te
enabled. Saw the same high-level output (step time, loss, etc.) before and after this change. Note: The loss was suspicious since it dropped from 10.307 to 0 after the third step. But this happened before and after this change so it appears to be caused by something elseI assume the loss is dropping so quickly because you are using synthetic data (
dataset_type=synthetic
), you can use real data to measure the loss. Ideally we would test correctness with something like our golden logits test
Thanks Matt. That was the issue with the loss. Also ran the golden logits test and it passed. Updated the description with the test
Notes
Update attn_mask_type so that it is not ignored. Related to issue 878.
Note that this change currently only impacts workloads using
cudnn_flash_te
attention.Testing
cudnn_flash_te
enabled. Saw the same high-level output (step time, loss, etc.) before and after this change