dmlc / dgl

Python package built to ease deep learning on graph, on top of existing DL frameworks.
http://dgl.ai
Apache License 2.0
13.24k stars 2.99k forks source link

random_walk() producing large integer possibly memory address on SLURM-based HPC environment #6946

Open rahit opened 6 months ago

rahit commented 6 months ago

Environment:

Description: I am encountering an issue with DGL's random_walk() function when running a script on a Graham cluster using SLURM (srun/sbatch). The function is supposed to return node IDs as part of its tensor output; however, when executed on the compute nodes of Graham through SLURM, it returns very large integers, which seem like memory addresses. This behavior is not observed when the script is run locally on my machine, on Google Colab.

Reproduction Steps:

  1. The code with the toy example is available on DGL's official documentation: https://docs.dgl.ai/en/1.1.x/generated/dgl.sampling.random\_walk.html Create a heterograph in DGL with the following code:

    from dgl import heterography
    from dgl.sampling import random_walk
    
    g2 = heterograph({
        ('user', 'follow', 'user'): ([0, 1, 1, 2, 3], [1, 2, 3, 0, 0]),
        ('user', 'view', 'item'): ([0, 0, 1, 2, 3, 3], [0, 1, 1, 2, 2, 1]),
        ('item', 'viewed-by', 'user'): ([0, 1, 1, 2, 2, 1], [0, 0, 1, 2, 3, 3])})
    
    print(random_walk(g2, [0, 1, 2, 0], metapath=['follow', 'view', 'viewed-by'] * 2))
  2. Execute the script using srun/sbatch on the Graham cluster.

Expected Behavior: The random_walk() function should return a tensor of node IDs, similar to when run on a local machine or the login node.

    (test_py39) [rahit@gra-login1 modspy-data]$ python src/modspy_data/test_dgl.py 
    (tensor([[0, 1, 1, 0, 1, 1, 3],
            [1, 3, 2, 2, 0, 0, 0],
            [2, 0, 1, 1, 2, 2, 3],
            [0, 1, 1, 3, 0, 1, 1]]), tensor([1, 1, 0, 1, 1, 0, 1]))

Actual Behavior: The function returns tensors containing very large integers, as shown below:

    (test_py39) [rahit@gra-login1 modspy-data]$ srun --ntasks=1 --cpus-per-task=1 --time=3:00 --mem=500 python ./src/modspy_data/test_dgl.py
    srun: job 14487954 queued and waiting for resources
    srun: job 14487954 has been allocated resources
    (tensor([[                  0,                   1,                   1,
                               0,                   1,                   1,
                               3],
            [7802034886504505161, 8028865303377573743,     563406901963619,
             2987123997513744384,                 225,           114293136,
                  47056890387456],
            [6866107348136439416, 8386095522570323780, 5795977025519175781,
             7022329414053225321,     110416352208244,                 161,
                       114293136],
            [     47056890387456, 5782977472600960876, 7802034886504505161,
             7237089388030031727, 7453010364987428197, 6485183463639119872,
                              97]]), tensor([1, 1, 0, 1, 1, 0, 1]))

Troubleshooting Done:

Questions/Support Needed:

Additional Information:

    (test_py39) [rahit@gra-login1 modspy-data]$ pip list
    Package            Version
    ------------------ --------------------
    certifi            2023.11.17
    charset-normalizer 3.3.2
    dgl                1.1.1+computecanada
    filelock           3.13.1+computecanada
    idna               3.6
    Jinja2             3.1.2+computecanada
    MarkupSafe         2.1.3+computecanada
    mpmath             1.3.0+computecanada
    networkx           3.2.1+computecanada
    numpy              1.25.2+computecanada
    pip                23.0+computecanada
    psutil             5.9.5+computecanada
    requests           2.31.0+computecanada
    scipy              1.11.2+computecanada
    setuptools         46.1.3
    sympy              1.12+computecanada
    torch              2.0.1+computecanada
    tqdm               4.66.1+computecanada
    typing_extensions  4.8.0+computecanada
    urllib3            2.1.0+computecanada
    wheel              0.34.2
github-actions[bot] commented 5 months ago

This issue has been automatically marked as stale due to lack of activity. It will be closed if no further activity occurs. Thank you

github-actions[bot] commented 4 months ago

This issue has been automatically marked as stale due to lack of activity. It will be closed if no further activity occurs. Thank you