Error running on GPU other than 0th index

GuyAglionby commented 7 months ago

There's an error if you set the --gpus flag to use only one GPU, and that GPU is not index 0.

Replication steps:

Setup

conda create -n ultra-vanilla python=3.9
conda activate ultra-vanilla
pip install torch==2.1.0
pip install torch-scatter==2.1.2 torch-sparse==0.6.18 torch-geometric==2.4.0 -f https://data.pyg.org/whl/torch-2.1.0+cu121.html
pip install ninja easydict pyyaml

Command (here I'm only running evaluation, but the same error also occurs if epochs >= 1)

CUDA_LAUNCH_BLOCKING=1 python script/run.py -c config/transductive/inference.yaml --dataset CoDExSmall --epochs 0 --bpe null --gpus [1] --ckpt /my/path/to/ckpts/ultra_4g.pth

Logs:

13:14:31   Random seed: 1024
13:14:31   Config file: config/transductive/inference.yaml
13:14:31   {'checkpoint': '/home/ubuntu/ultra_vanilla/ckpts/ultra_4g.pth',
 'dataset': {'class': 'CoDExSmall', 'root': '~/git/ULTRA/kg-datasets/'},
 'model': {'class': 'Ultra',
           'entity_model': {'aggregate_func': 'sum',
                            'class': 'IndNBFNet',
                            'hidden_dims': [64, 64, 64, 64, 64, 64],
                            'input_dim': 64,
                            'layer_norm': True,
                            'message_func': 'distmult',
                            'short_cut': True},
           'relation_model': {'aggregate_func': 'sum',
                              'class': 'NBFNet',
                              'hidden_dims': [64, 64, 64, 64, 64, 64],
                              'input_dim': 64,
                              'layer_norm': True,
                              'message_func': 'distmult',
                              'short_cut': True}},
 'optimizer': {'class': 'AdamW', 'lr': 0.0005},
 'output_dir': '~/git/ULTRA/output',
 'task': {'adversarial_temperature': 1,
          'metric': ['mr', 'mrr', 'hits@1', 'hits@3', 'hits@10'],
          'name': 'TransductiveInference',
          'num_negative': 256,
          'strict_negative': True},
 'train': {'batch_per_epoch': None,
           'batch_size': 8,
           'gpus': [1],
           'log_interval': 100,
           'num_epoch': 0}}
13:14:31   CoDExSmall dataset
13:14:31   #train: 32888, #valid: 1827, #test: 1828
13:14:32   >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
13:14:32   Evaluate on valid
Load rspmm extension. This may take a while...
Traceback (most recent call last):
  File "/home/ubuntu/ultra_vanilla/script/run.py", line 297, in <module>
    test(cfg, model, valid_data, filtered_data=val_filtered_data, device=device, logger=logger)
  File "/opt/conda/envs/ultra-vanilla/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/ubuntu/ultra_vanilla/script/run.py", line 136, in test
    t_pred = model(test_data, t_batch)
  File "/opt/conda/envs/ultra-vanilla/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/ultra-vanilla/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ubuntu/ultra_vanilla/ultra/models.py", line 22, in forward
    relation_representations = self.relation_model(data.relation_graph, query=query_rels)
  File "/opt/conda/envs/ultra-vanilla/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/ultra-vanilla/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ubuntu/ultra_vanilla/ultra/models.py", line 99, in forward
    output = self.bellmanford(rel_graph, h_index=query)["node_feature"]  # (batch_size, num_nodes, hidden_dim）
  File "/home/ubuntu/ultra_vanilla/ultra/models.py", line 75, in bellmanford
    hidden = layer(layer_input, query, boundary, data.edge_index, data.edge_type, size, edge_weight)
  File "/opt/conda/envs/ultra-vanilla/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/opt/conda/envs/ultra-vanilla/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ubuntu/ultra_vanilla/ultra/layers.py", line 85, in forward
    output = self.propagate(input=input, relation=relation, boundary=boundary, edge_index=edge_index,
  File "/home/ubuntu/ultra_vanilla/ultra/layers.py", line 112, in propagate
    out = self.message_and_aggregate(edge_index, **msg_aggr_kwargs)
  File "/home/ubuntu/ultra_vanilla/ultra/layers.py", line 194, in message_and_aggregate
    update = update + boundary
RuntimeError: CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

migalkin commented 7 months ago

Usually, you can point which physical GPU id to use with the CUDA_VISIBLE_DEVICES flag.

Could you please try CUDA_VISIBLE_DEVICES=1 and gpus=[0]? For example,

CUDA_VISIBLE_DEVICES=1 python script/run.py -c config/transductive/inference.yaml --dataset CoDExSmall --epochs 0 --bpe null --gpus [0] --ckpt /my/path/to/ckpts/ultra_4g.pth

works for me - on a machine with 2 GPUs, the 2nd one is used

GuyAglionby commented 7 months ago

Thanks for the quick reply! That works as expected, and is indeed how I'd normally do it -- was trying to fix a bug on a fork and came across this. (The bug is my doing, not ULTRA's)

Happy to close this and the PR if it's out of scope

migalkin commented 7 months ago

Yes, let's close them, those aren't really the repo issues but rather the standard CUDA setup params

DeepGraphLearning / ULTRA

Error running on GPU other than 0th index #20