Open mfbalin opened 7 months ago
@Rhett-Ying However, all_nodes_set can be an integer so it looks like currently, there is no way to specify its dtype. Casting to int is pretty crucial for performance though, I am hoping that we can find an easy solution.
dtype conversion could be covered by https://github.com/dmlc/dgl/pull/7127 when instantiating dataset.
What about itemsets made of an int: https://github.com/dmlc/dgl/blob/3ced3411e55bca803ed5ec5e1de6f62e1f21478f/python/dgl/graphbolt/itemset.py#L117
Because sample_neighbors does type checking for the nodes and indices tensors, if the things coming from the itemset don't match the graph indices dtype, it gives an error due to these lines: https://github.com/dmlc/dgl/blob/3ced3411e55bca803ed5ec5e1de6f62e1f21478f/python/dgl/graphbolt/impl/fused_csc_sampling_graph.py#L644-L647
As for item set that's generated in runtime such as all_nodes_set
, we need to figure out a way to format dtype according to graph.indices
I have changed the indice dtype to int32.
I have changed the indice dtype to int32.
Now, we are getting an error because train_set etc. is not in int32. Will need #7127 to be merged to resolve the error.
I have changed the indice dtype to int32.
Now, we are getting an error because train_set etc. is not in int32. Will need #7127 to be merged to resolve the error.
OK, we can wait for that PR to see if there are further questions.
For papers100M, it says that the data is already preprocessed. That is why it does not perform the required type casts, causing an error.
The dataset is already preprocessed.
For products dataset, however, it performs the preprocessing step when we first download it.
Extracting file to datasets
Start to preprocess the on-disk dataset.
Finish preprocessing the on-disk dataset.
Why is there such a discrepancy between these two datasets? I am guessing that we want to take the burden of preprocessing from the user by providing the preprocessed versions of these larger datasets to the users.
Also, after downloading papers, we get the following graph returned:
FusedCSCSamplingGraph(csc_indptr=tensor([ 0, 1, 9, ..., 3228124709, 3228124710,
3228124712]),
indices=tensor([102309412, 5808518, 6609397, ..., 92367769, 59629722,
95195371]),
total_num_nodes=111059956, num_edges=3228124712,
node_attributes={},
edge_attributes={},)
We get an error because the indices array is not using int32
but train_set is is using int32
.
So we need to modify papers100M
dataset and cast its indices
array of the graph into int32
.
@caojy1998
Also, for the mag240M dataset, if we also provide the preprocessed version, can we perform the following type casts?
indices
, node_type_offset
, train_set
, validation_set
, test_set
into int32
.type_per_edge
into uint8
.We don't need to do anything for all_nodes_set
because it automatically gets its dtype from graph.indices
.
@frozenbugs the mag240M dataset has yet to be converted to the right dtypes. Who do you think can tackle this task?
@Rhett-Ying When is the work for this issue planned? I see that you measure runtime on mag240M in some of the examples. Fixing this issue would lead to memory and runtime savings anywhere mag240M is used.
@Rhett-Ying When is the work for this issue planned? I see that you measure runtime on mag240M in some of the examples. Fixing this issue would lead to memory and runtime savings anywhere mag240M is used.
i'll do it this week
Looks like ogbn-arxiv also has the wrong dtypes, updated the issue description.
@Liu-rj FYI
@pyynb when the mag240M dataset is downloaded, it extracts into datasets/opt/nvme/..../ogb-lsc-mag240m-seeds
directory instead of datasets/ogb-lsc-mag240m-seeds
. Could you check?
@pyynb when the mag240M dataset is downloaded, it extracts into
datasets/opt/nvme/..../ogb-lsc-mag240m-seeds
directory instead ofdatasets/ogb-lsc-mag240m-seeds
. Could you check?
I'm on it
@pyynb Is there an update on the mag240M dataset path issue? Should I test it again?
🔨Work Item
IMPORTANT:
Project tracker: https://github.com/orgs/dmlc/projects/2
Description
Basically, any tensor that has indices corresponding to node ids should be stored with the int32 dtype.
indices
,node_type_offset
,train_set
,validation_set
,test_set
casted into int32.type_per_edge
into uint8 or int8.We don't need to do anything for
all_nodes_set
because it automatically gets its dtype from graph.indices.Saving features
When saving any feature tensors, we should make sure to use
gb.numpy_save_aligned
instead ofnumpy.save
.