NVIDIA / Fuser

A Fusion Code Generator for NVIDIA GPUs (commonly known as "nvFuser")
Other
262 stars 52 forks source link

Non-deterministic codegen with some of the Python tests #3283

Open naoyam opened 14 hours ago

naoyam commented 14 hours ago

The diff test started to keep failing with some of the Python tests. For example:

https://dl.gitlab-master-pages.nvidia.com/pytorch/fuser-gh-mirror//nvfuser_github_ci/codegen_diff_p19742638_j118568737_1729887841468788908_codediff_896a28ad_b9203e1c_custom_command_20241025_131551.html

Ran NVFUSER_DUMP=cuda_to_file pytest -v -k 'test_prim_layer_norm_fwd 10 times, and here are the number of kernels generated per each run:

6
5
6
4
6
5
5
6
5
4

CC: @jacobhinkle, @xwang233 Related: #3260 #3280 #3256

jacobhinkle commented 11 hours ago

Following @rdspring1's suggestion, I tried inserting return test_fn at this line: https://github.com/NVIDIA/Fuser/blob/a18dbd292251bf04ef45b05fa39a945841ef9cd3/tests/python/utils.py#L346. This seems to avoid the issue, indicating that this is a serde issue or maybe an issue just with this decorator. I did not need to use DEBUG_SERDE=true or delete /tmp/nvfuser_kernel_db for this.

naoyam commented 11 hours ago

Following @rdspring1's suggestion, I tried inserting return test_fn at this line:

Fuser/tests/python/utils.py

Line 346 in a18dbd2

. This seems to avoid the issue, indicating that this is a serde issue or maybe an issue just with this decorator. I did not need to use DEBUG_SERDE=true or delete /tmp/nvfuser_kernel_db for this.

Tried the same test 100 times. No difference detected!