random_walk() producing large integer possibly memory address on SLURM-based HPC environment

rahit commented 6 months ago

Environment:

Cluster: SLURM-based HPC platform by The Alliance's Graham Cluster
Python Version: Python 3.9.6
DGL Version: 1.1.1+computecanada
PyTorch Version: 2.0.1+computecanada

Description: I am encountering an issue with DGL's random_walk() function when running a script on a Graham cluster using SLURM (srun/sbatch). The function is supposed to return node IDs as part of its tensor output; however, when executed on the compute nodes of Graham through SLURM, it returns very large integers, which seem like memory addresses. This behavior is not observed when the script is run locally on my machine, on Google Colab.

Reproduction Steps:

The code with the toy example is available on DGL's official documentation: https://docs.dgl.ai/en/1.1.x/generated/dgl.sampling.random\_walk.html Create a heterograph in DGL with the following code:

from dgl import heterography
from dgl.sampling import random_walk

g2 = heterograph({
    ('user', 'follow', 'user'): ([0, 1, 1, 2, 3], [1, 2, 3, 0, 0]),
    ('user', 'view', 'item'): ([0, 0, 1, 2, 3, 3], [0, 1, 1, 2, 2, 1]),
    ('item', 'viewed-by', 'user'): ([0, 1, 1, 2, 2, 1], [0, 0, 1, 2, 3, 3])})

print(random_walk(g2, [0, 1, 2, 0], metapath=['follow', 'view', 'viewed-by'] * 2))

Execute the script using srun/sbatch on the Graham cluster.

Expected Behavior: The random_walk() function should return a tensor of node IDs, similar to when run on a local machine or the login node.

    (test_py39) [rahit@gra-login1 modspy-data]$ python src/modspy_data/test_dgl.py 
    (tensor([[0, 1, 1, 0, 1, 1, 3],
            [1, 3, 2, 2, 0, 0, 0],
            [2, 0, 1, 1, 2, 2, 3],
            [0, 1, 1, 3, 0, 1, 1]]), tensor([1, 1, 0, 1, 1, 0, 1]))

Actual Behavior: The function returns tensors containing very large integers, as shown below:

    (test_py39) [rahit@gra-login1 modspy-data]$ srun --ntasks=1 --cpus-per-task=1 --time=3:00 --mem=500 python ./src/modspy_data/test_dgl.py
    srun: job 14487954 queued and waiting for resources
    srun: job 14487954 has been allocated resources
    (tensor([[                  0,                   1,                   1,
                               0,                   1,                   1,
                               3],
            [7802034886504505161, 8028865303377573743,     563406901963619,
             2987123997513744384,                 225,           114293136,
                  47056890387456],
            [6866107348136439416, 8386095522570323780, 5795977025519175781,
             7022329414053225321,     110416352208244,                 161,
                       114293136],
            [     47056890387456, 5782977472600960876, 7802034886504505161,
             7237089388030031727, 7453010364987428197, 6485183463639119872,
                              97]]), tensor([1, 1, 0, 1, 1, 0, 1]))

Troubleshooting Done:

Verified that the script runs as expected on local environments and the login node.
Checked for any discrepancies in the environment and DGL version between the local setup and the cluster.
Ensured that the Python and DGL environments are consistent.

Questions/Support Needed:

Is there any known issue with DGL's random_walk() or other functions when used in a distributed environment like SLURM-based HPC environment?
Could this be related to how memory is managed or accessed differently in the compute nodes via SLURM?
Are there any additional configurations or environment settings I should consider for running DGL on a distributed system like Graham?
- *

Additional Information:

Modules Loaded:

(test_py39) [rahit@gra-login1 modspy-data]$ module list

Currently Loaded Modules:
  1) CCconfig          3) imkl/2020.1.217 (math)   5) gcccore/.9.3.0 (H)   7) ucx/1.8.0          9) openmpi/4.0.3 (m)  11) python/3.9.6 (t)  13) protobuf/3.21.3 (t)
  2) gentoo/2020 (S)   4) StdEnv/2020     (S)      6) gcc/9.3.0      (t)   8) libfabric/1.10.1  10) libffi/3.3         12) cmake/3.27.7 (t)

  Where:
   S:     Module is Sticky, requires --force to unload or purge
   m:     MPI implementations / Implémentations MPI
   math:  Mathematical libraries / Bibliothèques mathématiques
   t:     Tools for development / Outils de développement
   H:                Hidden Module

PyPI packages installed in the virtual environment:

    (test_py39) [rahit@gra-login1 modspy-data]$ pip list
    Package            Version
    ------------------ --------------------
    certifi            2023.11.17
    charset-normalizer 3.3.2
    dgl                1.1.1+computecanada
    filelock           3.13.1+computecanada
    idna               3.6
    Jinja2             3.1.2+computecanada
    MarkupSafe         2.1.3+computecanada
    mpmath             1.3.0+computecanada
    networkx           3.2.1+computecanada
    numpy              1.25.2+computecanada
    pip                23.0+computecanada
    psutil             5.9.5+computecanada
    requests           2.31.0+computecanada
    scipy              1.11.2+computecanada
    setuptools         46.1.3
    sympy              1.12+computecanada
    torch              2.0.1+computecanada
    tqdm               4.66.1+computecanada
    typing_extensions  4.8.0+computecanada
    urllib3            2.1.0+computecanada
    wheel              0.34.2

github-actions[bot] commented 5 months ago

This issue has been automatically marked as stale due to lack of activity. It will be closed if no further activity occurs. Thank you

github-actions[bot] commented 4 months ago

This issue has been automatically marked as stale due to lack of activity. It will be closed if no further activity occurs. Thank you

dmlc / dgl

random_walk() producing large integer possibly memory address on SLURM-based HPC environment #6946