Open walidgeuttala opened 6 months ago
The non-deterministic seems from the GINConv, which calls cuSparse function underneath. To confirm that, you could replace the line 96 and the line 102 with
GINConv(mlp, "max", learn_eps=False)
Current guess is that the nondeterministic comes from cusparse's "CUSPARSE_SPMM_CSR_ALG2". Might need to switch to ALG3 to get deterministic results. Will update.
@jermainewang @frozenbugs FYI
Confirmed the ALG3 gives deterministic results. @jermainewang Needs to enable that into DGL for user to choose if deterministic results is required. Here are the comments in the cusparsespmm documentation.
CUSPARSE_SPMM_CSR_ALG2 Algorithm 2 for CSR/CSC sparse matrix format
CUSPARSE_SPMM_CSR_ALG3 Algorithm 3 for CSR/CSC sparse matrix format
🐛 Bug
CUDA gives non-deterministic results, while the CPU does. I have fixed the environment and followed the PyTorch documentation for all steps. Additionally, I have made sure to load the same weights for the model. However, there is a difference in the loss between the same trials, resulting in different loss values.
use_deterministic_algorithms(True)
CUBLAS_WORKSPACE_CONFIG=:4096:8 python test_gpu.py