dmlc / dgl

Python package built to ease deep learning on graph, on top of existing DL frameworks.
http://dgl.ai
Apache License 2.0
13.18k stars 2.99k forks source link

[Dist] test_mp_dataloader::test_dist_dataloader seg fault occasionally #7463

Closed Rhett-Ying closed 2 weeks ago

Rhett-Ying commented 2 weeks ago

🔨Work Item

IMPORTANT:

Project tracker: https://github.com/orgs/dmlc/projects/2

Description

tests/distributed/test_mp_dataloader.py::test_dist_dataloader[False-False-True-0-1] Fatal Python error: Segmentation fault

Thread 0x00007f7526880700 (most recent call first):

File "/opt/conda/envs/pytorch-ci/lib/python3.10/threading.py", line 324 in wait

File "/opt/conda/envs/pytorch-ci/lib/python3.10/threading.py", line 607 in wait

File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/tqdm/_monitor.py", line 60 in run

File "/opt/conda/envs/pytorch-ci/lib/python3.10/threading.py", line 1016 in _bootstrap_inner

File "/opt/conda/envs/pytorch-ci/lib/python3.10/threading.py", line 973 in _bootstrap

Current thread 0x00007f755357e800 (most recent call first):

Garbage-collecting

File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/networkx/classes/multigraph.py", line 422 in new_edge_key

File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/networkx/classes/multidigraph.py", line 497 in add_edge

File "/home/ubuntu/jenkins/workspace/dgl_master/python/dgl/convert.py", line 1667 in _to_networkx_homogeneous

File "/home/ubuntu/jenkins/workspace/dgl_master/python/dgl/convert.py", line 1864 in to_networkx

File "/home/ubuntu/jenkins/workspace/dgl_master/python/dgl/data/citation_graph.py", line 249 in load

File "/home/ubuntu/jenkins/workspace/dgl_master/python/dgl/data/dgl_dataset.py", line 190 in _load

File "/home/ubuntu/jenkins/workspace/dgl_master/python/dgl/data/dgl_dataset.py", line 112 in init

File "/home/ubuntu/jenkins/workspace/dgl_master/python/dgl/data/dgl_dataset.py", line 333 in init

File "/home/ubuntu/jenkins/workspace/dgl_master/python/dgl/data/citation_graph.py", line 97 in init

File "/home/ubuntu/jenkins/workspace/dgl_master/tests/distributed/test_mp_dataloader.py", line 383 in test_dist_dataloader

Depending work items or issues

Rhett-Ying commented 2 weeks ago

it often crash in get_peak_mem().

tests/distributed/test_mp_dataloader.py::test_dataloader_heterograph[True-True-node-0-1] Fatal Python error: Segmentation fault

Thread 0x00007fb089237700 (most recent call first):

File "/opt/conda/envs/pytorch-ci/lib/python3.10/threading.py", line 324 in wait

File "/opt/conda/envs/pytorch-ci/lib/python3.10/threading.py", line 607 in wait

File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/tqdm/_monitor.py", line 60 in run

File "/opt/conda/envs/pytorch-ci/lib/python3.10/threading.py", line 1016 in _bootstrap_inner

File "/opt/conda/envs/pytorch-ci/lib/python3.10/threading.py", line 973 in _bootstrap

Current thread 0x00007fb0b5f3d800 (most recent call first):

File "/opt/conda/envs/pytorch-ci/lib/python3.10/codecs.py", line 322 in decode

File "/home/ubuntu/jenkins/workspace/dgl_PR-7464/python/dgl/partition.py", line 271 in get_peak_mem

File "/home/ubuntu/jenkins/workspace/dgl_PR-7464/python/dgl/distributed/partition.py", line 921 in partition_graph

File "/home/ubuntu/jenkins/workspace/dgl_PR-7464/tests/distributed/test_mp_dataloader.py", line 770 in check_dataloader

File "/home/ubuntu/jenkins/workspace/dgl_PR-7464/tests/distributed/test_mp_dataloader.py", line 965 in test_dataloader_heterograph

Rhett-Ying commented 2 weeks ago

This issue is probably caused by insufficient shared memory. I increase it from 4GB to 8GB and it works well. let's see if this eliminate the glitch.