dmlc / dgl

Python package built to ease deep learning on graph, on top of existing DL frameworks.
http://dgl.ai
Apache License 2.0
13.44k stars 3.01k forks source link

[GraphBolt] Dataset dtypes #7130

Open mfbalin opened 7 months ago

mfbalin commented 7 months ago

🔨Work Item

IMPORTANT:

Project tracker: https://github.com/orgs/dmlc/projects/2

Description

Basically, any tensor that has indices corresponding to node ids should be stored with the int32 dtype.

We don't need to do anything for all_nodes_set because it automatically gets its dtype from graph.indices.

Saving features

When saving any feature tensors, we should make sure to use gb.numpy_save_aligned instead of numpy.save.

mfbalin commented 7 months ago

@Rhett-Ying However, all_nodes_set can be an integer so it looks like currently, there is no way to specify its dtype. Casting to int is pretty crucial for performance though, I am hoping that we can find an easy solution.

mfbalin commented 7 months ago

7131 is related to this issue.

Rhett-Ying commented 7 months ago

dtype conversion could be covered by https://github.com/dmlc/dgl/pull/7127 when instantiating dataset.

mfbalin commented 7 months ago

What about itemsets made of an int: https://github.com/dmlc/dgl/blob/3ced3411e55bca803ed5ec5e1de6f62e1f21478f/python/dgl/graphbolt/itemset.py#L117

mfbalin commented 7 months ago

Because sample_neighbors does type checking for the nodes and indices tensors, if the things coming from the itemset don't match the graph indices dtype, it gives an error due to these lines: https://github.com/dmlc/dgl/blob/3ced3411e55bca803ed5ec5e1de6f62e1f21478f/python/dgl/graphbolt/impl/fused_csc_sampling_graph.py#L644-L647

Rhett-Ying commented 7 months ago

As for item set that's generated in runtime such as all_nodes_set, we need to figure out a way to format dtype according to graph.indices

caojy1998 commented 7 months ago

I have changed the indice dtype to int32.

mfbalin commented 7 months ago

I have changed the indice dtype to int32.

Now, we are getting an error because train_set etc. is not in int32. Will need #7127 to be merged to resolve the error.

caojy1998 commented 7 months ago

I have changed the indice dtype to int32.

Now, we are getting an error because train_set etc. is not in int32. Will need #7127 to be merged to resolve the error.

OK, we can wait for that PR to see if there are further questions.

mfbalin commented 7 months ago

For papers100M, it says that the data is already preprocessed. That is why it does not perform the required type casts, causing an error.

The dataset is already preprocessed.

For products dataset, however, it performs the preprocessing step when we first download it.

Extracting file to datasets
Start to preprocess the on-disk dataset.
Finish preprocessing the on-disk dataset.

Why is there such a discrepancy between these two datasets? I am guessing that we want to take the burden of preprocessing from the user by providing the preprocessed versions of these larger datasets to the users.

Also, after downloading papers, we get the following graph returned:

FusedCSCSamplingGraph(csc_indptr=tensor([         0,          1,          9,  ..., 3228124709, 3228124710,
                              3228124712]),
                      indices=tensor([102309412,   5808518,   6609397,  ...,  92367769,  59629722,
                               95195371]),
                      total_num_nodes=111059956, num_edges=3228124712,
                      node_attributes={},
                      edge_attributes={},)

We get an error because the indices array is not using int32 but train_set is is using int32.

So we need to modify papers100M dataset and cast its indices array of the graph into int32.

@caojy1998

mfbalin commented 7 months ago

Also, for the mag240M dataset, if we also provide the preprocessed version, can we perform the following type casts?

  1. indices, node_type_offset, train_set, validation_set, test_set into int32.
  2. type_per_edge into uint8.

We don't need to do anything for all_nodes_set because it automatically gets its dtype from graph.indices.

mfbalin commented 5 months ago

@frozenbugs the mag240M dataset has yet to be converted to the right dtypes. Who do you think can tackle this task?

mfbalin commented 3 months ago

@Rhett-Ying When is the work for this issue planned? I see that you measure runtime on mag240M in some of the examples. Fixing this issue would lead to memory and runtime savings anywhere mag240M is used.

pyynb commented 3 months ago

@Rhett-Ying When is the work for this issue planned? I see that you measure runtime on mag240M in some of the examples. Fixing this issue would lead to memory and runtime savings anywhere mag240M is used.

i'll do it this week

mfbalin commented 2 months ago

Looks like ogbn-arxiv also has the wrong dtypes, updated the issue description.

Rhett-Ying commented 2 months ago

@Liu-rj FYI

mfbalin commented 1 month ago

@pyynb when the mag240M dataset is downloaded, it extracts into datasets/opt/nvme/..../ogb-lsc-mag240m-seeds directory instead of datasets/ogb-lsc-mag240m-seeds. Could you check?

pyynb commented 1 month ago

@pyynb when the mag240M dataset is downloaded, it extracts into datasets/opt/nvme/..../ogb-lsc-mag240m-seeds directory instead of datasets/ogb-lsc-mag240m-seeds. Could you check?

I'm on it

mfbalin commented 1 month ago

@pyynb Is there an update on the mag240M dataset path issue? Should I test it again?