GPU reprodicitibily issue

walidgeuttala commented 6 months ago

🐛 Bug

CUDA gives non-deterministic results, while the CPU does. I have fixed the environment and followed the PyTorch documentation for all steps. Additionally, I have made sure to load the same weights for the model. However, there is a difference in the loss between the same trials, resulting in different loss values.

I load the same model weights
I don't use the dropout
I tried not use .to(device) by saving the data in gpu format
I used use_deterministic_algorithms(True)
I used CUBLAS_WORKSPACE_CONFIG=:4096:8 python test_gpu.py

I fixed the seed using this function


def set_random_seed(seed=0):
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
if torch.cuda.is_available():
    torch.cuda.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
dgl.random.seed(seed)
torch.use_deterministic_algorithms(True)


## To Reproduce

Steps to reproduce the behavior:

1.  I run the command like this: `CUBLAS_WORKSPACE_CONFIG=:4096:8 python test_gpu.py`
1. the data used and the code are in the github link : [Github](https://github.com/walidgeuttala/test_gnn)
1.

<!-- If you have a code sample, error messages, stack traces, please provide it here as well -->

## Expected behavior
CUDA:

Weights loaded successfully.
Epoch 10: loss=450.9244
Epoch 20: loss=61.4404
Epoch 30: loss=28.8193
Epoch 40: loss=39.5702
Epoch 50: loss=21.3023
Epoch 60: loss=16.8473
Epoch 70: loss=13.7806
Epoch 80: loss=12.8241
Epoch 90: loss=12.3923
Epoch 100: loss=12.1758
test1 loss : 5.315503692626953

Weights loaded successfully.
Epoch 10: loss=558.8957
Epoch 20: loss=52.2925
Epoch 30: loss=27.9202
Epoch 40: loss=35.4178
Epoch 50: loss=20.8413
Epoch 60: loss=16.7942
Epoch 70: loss=13.7631
Epoch 80: loss=12.8431
Epoch 90: loss=12.4301
Epoch 100: loss=12.2197
test1 loss : 5.571553497314453

CPU:

Weights loaded successfully.
Epoch 10: loss=335.7557
Epoch 20: loss=58.5180
Epoch 30: loss=30.8322
Epoch 40: loss=46.3416
Epoch 50: loss=25.4986
Epoch 60: loss=19.0068
Epoch 70: loss=16.7942
Epoch 80: loss=15.8844
Epoch 90: loss=15.4508
Epoch 100: loss=15.2272
test1 loss : 6.81760009765625

Weights loaded successfully.
Epoch 10: loss=335.7557
Epoch 20: loss=58.5180
Epoch 30: loss=30.8322
Epoch 40: loss=46.3416
Epoch 50: loss=25.4986
Epoch 60: loss=19.0068
Epoch 70: loss=16.7942
Epoch 80: loss=15.8844
Epoch 90: loss=15.4508
Epoch 100: loss=15.2272
test1 loss : 6.81760009765625

<!-- A clear and concise description of what you expected to happen. -->
I expect to have the same deterministic results from the two trials that have the same model weights and fixed environment.
## Environment

 - DGL Version:  2.1.0+cu118
 - Backend Library & Version: Pytorch2.2.1+cu118
 - OS : Linux
 - How you installed DGL (`conda`, `pip`, source): conda
 - Build command you used (if compiling from source):
 - Python version:  3.8.19
 - CUDA/cuDNN version (if applicable): cuda_11.8
 - GPU models and configuration (e.g. V100): Quadro RTX 6000
 - Any other relevant information: I use HPC

## Additional context

<!-- Add any other context about the problem here. -->

TristonC commented 5 months ago

The non-deterministic seems from the GINConv, which calls cuSparse function underneath. To confirm that, you could replace the line 96 and the line 102 with

GINConv(mlp, "max", learn_eps=False)

Current guess is that the nondeterministic comes from cusparse's "CUSPARSE_SPMM_CSR_ALG2". Might need to switch to ALG3 to get deterministic results. Will update.

@jermainewang @frozenbugs FYI

TristonC commented 5 months ago

Confirmed the ALG3 gives deterministic results. @jermainewang Needs to enable that into DGL for user to choose if deterministic results is required. Here are the comments in the cusparsespmm documentation.

CUSPARSE_SPMM_CSR_ALG2 Algorithm 2 for CSR/CSC sparse matrix format

Provides the best performance with row-major layout
It supports batched computation
It requires additional memory
May produce slightly different results during different runs with the same input parameters

CUSPARSE_SPMM_CSR_ALG3 Algorithm 3 for CSR/CSC sparse matrix format

It provides deterministic result
It requires additional memory
It supports only opA == CUSPARSE_OPERATION_NON_TRANSPOSE
It does not support opB == CUSPARSE_OPERATION_CONJUGATE_TRANSPOSE
It does not support CUDA_C_16F and CUDA_C_16BF data types

dmlc / dgl

GPU reprodicitibily issue #7241

🐛 Bug