Open rahit opened 6 months ago
This issue has been automatically marked as stale due to lack of activity. It will be closed if no further activity occurs. Thank you
This issue has been automatically marked as stale due to lack of activity. It will be closed if no further activity occurs. Thank you
Environment:
Description: I am encountering an issue with DGL's
random_walk()
function when running a script on a Graham cluster using SLURM (srun
/sbatch
). The function is supposed to return node IDs as part of its tensor output; however, when executed on the compute nodes of Graham through SLURM, it returns very large integers, which seem like memory addresses. This behavior is not observed when the script is run locally on my machine, on Google Colab.Reproduction Steps:
The code with the toy example is available on DGL's official documentation: https://docs.dgl.ai/en/1.1.x/generated/dgl.sampling.random\_walk.html Create a heterograph in DGL with the following code:
Execute the script using
srun
/sbatch
on the Graham cluster.Expected Behavior: The
random_walk()
function should return a tensor of node IDs, similar to when run on a local machine or the login node.Actual Behavior: The function returns tensors containing very large integers, as shown below:
Troubleshooting Done:
Questions/Support Needed:
Is there any known issue with DGL's
random_walk()
or other functions when used in a distributed environment like SLURM-based HPC environment?Could this be related to how memory is managed or accessed differently in the compute nodes via SLURM?
Are there any additional configurations or environment settings I should consider for running DGL on a distributed system like Graham?
Additional Information:
Modules Loaded:
PyPI packages installed in the virtual environment: