dmlc / dgl

Python package built to ease deep learning on graph, on top of existing DL frameworks.
http://dgl.ai
Apache License 2.0
13.19k stars 2.99k forks source link

[GraphBolt][Bug] SEGV when preprocessing `OnDiskDataset` #7364

Open easypickings opened 2 months ago

easypickings commented 2 months ago

🐛 Bug

To Reproduce

When trying to construct a OnDiskDataset with the UK-Union graph, I get segmentation fault during preprocessing. The error message is either munmap_chunk(): invalid pointer or double free or corruption (out). I further locate the error comes from the following line:

https://github.com/dmlc/dgl/blob/1547bd931d17cd1da144a6d38bb687c0f2c3b364/python/dgl/graphbolt/impl/ondisk_dataset.py#L97

Steps to reproduce the behavior:

execute the code:

import dgl.graphbolt as gb
dataset = gb.OnDiskDataset("path/to/dataset")

Expected behavior

Environment

Additional context

Rhett-Ying commented 2 months ago

could you make sure the num_nodes specified is exactly same as the node IDs read from edge file? https://github.com/dmlc/dgl/blob/1547bd931d17cd1da144a6d38bb687c0f2c3b364/python/dgl/graphbolt/impl/ondisk_dataset.py#L92C21-L97

easypickings commented 2 months ago

could you make sure the num_nodes specified is exactly same as the node IDs read from edge file? https://github.com/dmlc/dgl/blob/1547bd931d17cd1da144a6d38bb687c0f2c3b364/python/dgl/graphbolt/impl/ondisk_dataset.py#L92C21-L97

Yes, the node ids in the edge file are consecutive from 0 to num_nodes -1. Also I can construct the coo and csc matrix using scipy.sparse.

Rhett-Ying commented 2 months ago

how large is your dataset? num_nodes, num_edges?

And could you try to comment out below line? https://github.com/dmlc/dgl/blob/1547bd931d17cd1da144a6d38bb687c0f2c3b364/python/dgl/graphbolt/impl/ondisk_dataset.py#L96C13-L96C23

easypickings commented 2 months ago

num_nodes = 131814559 and num_edges = 5507679822. comment out is no use.

Rhett-Ying commented 2 months ago

oh, it's a large graph with more than 5B edges. what's your instance for running this? how much is then RAM?

easypickings commented 2 months ago

I'm running on an aliyun server with over 700GB RAM

Rhett-Ying commented 2 months ago

@yxy235 could you try to reproduce this error on r6i.metal with a random graph?

yxy235 commented 2 months ago

@yxy235 could you try to reproduce this error on r6i.metal with a random graph?

OK

yxy235 commented 2 months ago

I have tried to reproduce this, but I didn't get any errors with a random same-size graph.

easypickings commented 2 months ago

@yxy235 Could you try using this data? https://mega.nz/folder/OWBwEQQL#nfkbhC35N4aLavIpCS2Cig (the sha256 is of the decompressed edges.npy, which is about 42GB)

yxy235 commented 2 months ago

@yxy235 Could you try using this data? https://mega.nz/folder/OWBwEQQL#nfkbhC35N4aLavIpCS2Cig (the sha256 is of the decompressed edges.npy, which is about 42GB)

OK. I have reproduced the error, I'm trying to debug now.

yxy235 commented 2 months ago

@yxy235 Could you try using this data? https://mega.nz/folder/OWBwEQQL#nfkbhC35N4aLavIpCS2Cig (the sha256 is of the decompressed edges.npy, which is about 42GB)

@easypickings Could you try to change the dtype of your edge.npy to int64? I think this problem can be resolved. This problem is caused by edge number exceeds int32. This caused error during constructing SparseMatrix from coo to csc. The dtype change is a workaround to solve the problem temporarily. FYI, this workaround may cause double memoery consumption.

yxy235 commented 1 month ago

TBD: Functions used in https://github.com/dmlc/dgl/blob/f0213d2163245cd0f0a90fc8aa8e66e94fd3724c/src/array/cpu/spmat_op_impl_coo.cc#L749 should be check, especisally https://github.com/dmlc/dgl/blob/f0213d2163245cd0f0a90fc8aa8e66e94fd3724c/src/array/cpu/spmat_op_impl_coo.cc#L538. We should determine dtype of csr through coo.row->shape[0] rather than coo.row->dtype. If shape is bigger than MAX_INT32 and no matter coo.row->dtype is int32 or int64, we should use int64.

Rhett-Ying commented 3 weeks ago

TBD: Functions used in

https://github.com/dmlc/dgl/blob/f0213d2163245cd0f0a90fc8aa8e66e94fd3724c/src/array/cpu/spmat_op_impl_coo.cc#L749 should be check, especisally

https://github.com/dmlc/dgl/blob/f0213d2163245cd0f0a90fc8aa8e66e94fd3724c/src/array/cpu/spmat_op_impl_coo.cc#L538 . We should determine dtype of csr through coo.row->shape[0] rather than coo.row->dtype. If shape is bigger than MAX_INT32 and no matter coo.row->dtype is int32 or int64, we should use int64.

@Skeleton003 please help work on this.