Closed Rhett-Ying closed 2 weeks ago
it often crash in get_peak_mem()
.
tests/distributed/test_mp_dataloader.py::test_dataloader_heterograph[True-True-node-0-1] Fatal Python error: Segmentation fault
Thread 0x00007fb089237700 (most recent call first):
File "/opt/conda/envs/pytorch-ci/lib/python3.10/threading.py", line 324 in wait
File "/opt/conda/envs/pytorch-ci/lib/python3.10/threading.py", line 607 in wait
File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/tqdm/_monitor.py", line 60 in run
File "/opt/conda/envs/pytorch-ci/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/opt/conda/envs/pytorch-ci/lib/python3.10/threading.py", line 973 in _bootstrap
Current thread 0x00007fb0b5f3d800 (most recent call first):
File "/opt/conda/envs/pytorch-ci/lib/python3.10/codecs.py", line 322 in decode
File "/home/ubuntu/jenkins/workspace/dgl_PR-7464/python/dgl/partition.py", line 271 in get_peak_mem
File "/home/ubuntu/jenkins/workspace/dgl_PR-7464/python/dgl/distributed/partition.py", line 921 in partition_graph
File "/home/ubuntu/jenkins/workspace/dgl_PR-7464/tests/distributed/test_mp_dataloader.py", line 770 in check_dataloader
File "/home/ubuntu/jenkins/workspace/dgl_PR-7464/tests/distributed/test_mp_dataloader.py", line 965 in test_dataloader_heterograph
This issue is probably caused by insufficient shared memory. I increase it from 4GB to 8GB and it works well. let's see if this eliminate the glitch.
🔨Work Item
IMPORTANT:
Project tracker: https://github.com/orgs/dmlc/projects/2
Description
tests/distributed/test_mp_dataloader.py::test_dist_dataloader[False-False-True-0-1] Fatal Python error: Segmentation fault
Thread 0x00007f7526880700 (most recent call first):
File "/opt/conda/envs/pytorch-ci/lib/python3.10/threading.py", line 324 in wait
File "/opt/conda/envs/pytorch-ci/lib/python3.10/threading.py", line 607 in wait
File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/tqdm/_monitor.py", line 60 in run
File "/opt/conda/envs/pytorch-ci/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
File "/opt/conda/envs/pytorch-ci/lib/python3.10/threading.py", line 973 in _bootstrap
Current thread 0x00007f755357e800 (most recent call first):
Garbage-collecting
File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/networkx/classes/multigraph.py", line 422 in new_edge_key
File "/opt/conda/envs/pytorch-ci/lib/python3.10/site-packages/networkx/classes/multidigraph.py", line 497 in add_edge
File "/home/ubuntu/jenkins/workspace/dgl_master/python/dgl/convert.py", line 1667 in _to_networkx_homogeneous
File "/home/ubuntu/jenkins/workspace/dgl_master/python/dgl/convert.py", line 1864 in to_networkx
File "/home/ubuntu/jenkins/workspace/dgl_master/python/dgl/data/citation_graph.py", line 249 in load
File "/home/ubuntu/jenkins/workspace/dgl_master/python/dgl/data/dgl_dataset.py", line 190 in _load
File "/home/ubuntu/jenkins/workspace/dgl_master/python/dgl/data/dgl_dataset.py", line 112 in init
File "/home/ubuntu/jenkins/workspace/dgl_master/python/dgl/data/dgl_dataset.py", line 333 in init
File "/home/ubuntu/jenkins/workspace/dgl_master/python/dgl/data/citation_graph.py", line 97 in init
File "/home/ubuntu/jenkins/workspace/dgl_master/tests/distributed/test_mp_dataloader.py", line 383 in test_dist_dataloader
Depending work items or issues