dmlc / dgl

Python package built to ease deep learning on graph, on top of existing DL frameworks.
http://dgl.ai
Apache License 2.0
13.54k stars 3.02k forks source link

[BugFIx] Support different dtype for `indptr`, `indices` and `data` in `COOToCSR` #7457

Open Skeleton003 opened 5 months ago

Skeleton003 commented 5 months ago

πŸ”¨Work Item

IMPORTANT:

Project tracker: https://github.com/orgs/dmlc/projects/2

Description

Background πŸ—ΊοΈ

We now have 4 cpp functions responsible for converting COO to CSR, they are SortedCOOToCSR, UnSortedSmallCOOToCSR, UnSortedSparseCOOToCSR and UnSortedDenseCOOToCSR. The selection of the appropriate COOToCSR function among them is based on a heuristic approach (https://github.com/dmlc/dgl/blob/f0213d2163245cd0f0a90fc8aa8e66e94fd3724c/src/array/cpu/spmat_op_impl_coo.cc#L749).

Bug πŸ›

Currently, all 4 COOToCSR functions are defined with the template template <class IdType>. Let us take UnSortedSparseCOOToCSR as an example. https://github.com/dmlc/dgl/blob/f0213d2163245cd0f0a90fc8aa8e66e94fd3724c/src/array/cpu/spmat_op_impl_coo.cc#L413-L414

In the task of converting COO to CSR, the data type IdType is designated in https://github.com/dmlc/dgl/blob/f0213d2163245cd0f0a90fc8aa8e66e94fd3724c/src/array/array.cc#L809 , indicating that IdType is actually equal to the data type of coo.row.

And then, in the current implementation of these COOToCSR functions, the constructed ret_indptr, ret_indices and ret_data are all set to be of dtype IdType. https://github.com/dmlc/dgl/blob/f0213d2163245cd0f0a90fc8aa8e66e94fd3724c/src/array/cpu/spmat_op_impl_coo.cc#L426-L433

This is definitely not right because ret_indptr, ret_indices and ret_data do not necessarily have the same data type. Let's break them down in detail:

  1. ret_indices: Its dtype is the same as coo.row.dtype(IdType). This is the only correct part of the current implementation.
  2. ret_indptr: Its dtype depends on the number of non zero elements(NNZ). If IdType is int32 but NNZ exceeds INT32_MAX, then ret_indptr.dtype should be int64, not the same as ret_indices.
  3. ret_data: Its dtype should be exactly the same as coo.data. It could be int, float or even bool, not guaranteed to be the same as ret_indices.

Question ❓

Working Plan 🧭

  1. Remove template <class IdType>, making COOToCSR a non-template function.
  2. Dynamically determine the dtypes of ret_indptr, ret_indices and ret_data as follow.
      1. ret_indices.dtype <- coo.row.dtype,
      1. ret_indptr.dtype <- whether NNZ exceeds INT32_MAX (however, if coo.row is of int64, we set ret_indptr as int64 anyway),
      1. ret_data.dtype <- coo.data.dtype (if coo.data is null, set dtype the same as ret_indptr).

Reference πŸ”—

  1. 699 : This 5-year-old PR implemented the first COOToCSR function with template <DLDeviceType XPU, typename IdType, typename DType> .

  2. 1251 : This 4-year-old PR removed typename DType but kept typename IdType.

  3. 3326 : This 3-year-old PR extended COOToCSR to Sorted, UnSortedDense and UnSortedSparse versions, but still kept typename IdType.

None of these venerable PRs explained why we need IdType. πŸ€”

Acknowledgement πŸ‘

7364 for reporting this bug.

Skeleton003 commented 5 months ago

@frozenbugs @Rhett-Ying @peizhou001 @mfbalin @BarclayII I'd be grateful for your opinions and suggestions on this issue. Because this involves modifying the code written many years ago, I'm afraid I may not have thought it through enough.

Rhett-Ying commented 5 months ago

I think it's a historical issue that only IdType is used only that does not take more scenarios into consideration.

Dynamically determining the type of indptr, indices, data is the correct way. please go ahead with the changes required.

Before the change, is it possible to add a check for data type to throw exception in the scenario we hit in the issue?

Skeleton003 commented 5 months ago

I think it's a historical issue that only IdType is used only that does not take more scenarios into consideration.

Dynamically determining the type of indptr, indices, data is the correct way. please go ahead with the changes required.

Before the change, is it possible to add a check for data type to throw exception in the scenario we hit in the issue?

Yes, please help review #7459 .

Skeleton003 commented 4 months ago

After further investigation, I've discovered a possible reason why we need IdType. It's because the approaches of cuda implementation of COOToCSR are different for int32 and int64. See https://github.com/dmlc/dgl/blob/ed50c170dda9627730cb8ee4c7110205b6ea09de/src/array/cuda/coo2csr.cu#L25 and https://github.com/dmlc/dgl/blob/ed50c170dda9627730cb8ee4c7110205b6ea09de/src/array/cuda/coo2csr.cu#L100 .

If there is no way to merge these 2 approaches into an Integral one, we have to keep COOToCSR a template function and keep IdType parameter. But we still need to Dynamically determine the dtypes of ret_indptr, ret_indices and ret_data.