alexbovet / flow_stability

Flow stability framework for dynamic community detection in temporal networks
GNU Lesser General Public License v3.0
24 stars 4 forks source link

Index sorting in the `csr_` operations #47

Open j-i-l opened 1 month ago

j-i-l commented 1 month ago

Comparing the output of the csr_ methods with native implementations we get mismatches on .data and .indices (see e.g. https://github.com/alexbovet/flow_stability/actions/runs/10865864561/job/30152834847#step:6:102).

A smaller example illustrates this better (also comparing csr_add to a native addition):

# data
[0.16911086 0.26525225 0.04051438 0.09085281 0.55254878 0.82202173
 0.51055564 0.50439578 0.85289212 0.74480397]  # native
[0.16911086 0.26525225 0.04051438 0.09085281 0.55254878 0.82202173
 0.51055564 0.85289212 0.50439578 0.74480397]  # csr_add
# indicies
[0 1 1 2 3 3 4 5 7 9]  # native
[0 1 1 2 3 3 4 7 5 9]  # csr_add
# indptr
[ 0  0  1  2  2  4  5  5  7 10 10]  # native
[ 0  0  1  2  2  4  5  5  7 10 10]  # csr_add

Basically, the indices are not ordered within a indptr slice but data match the indices (at least in this example). Not sure if that leads to any trouble down the road.

It is expected that the indices may not be sorted after some operations. csr matrices have a has_sorted_indices attribute. If you do a sort_indices() and eliminate_zeros() before the np.testing.assert_allclose(), it should hopefully fix the problem.

_Originally posted by @alexbovet in https://github.com/alexbovet/flow_stability/issues/35#issuecomment-2353056512_

j-i-l commented 1 month ago

Regarding the tests, there is now a little helper function that allows to compare matrices with not necessarily sorted indices:

https://github.com/alexbovet/flow_stability/blob/b658f27a67921b5f69d192ebbe8c058c8cfea73d/tests/conftest.py#L75-L88


@alexbovet Regarding the csr_ operations, would it make sense to perform an in-place sort_indices() before returning the csr_matrix?

alexbovet commented 1 month ago

@alexbovet Regarding the csr_ operations, would it make sense to perform an in-place sort_indices() before returning the csr_matrix?

Is it necessary?

j-i-l commented 1 month ago

@alexbovet Regarding the csr_ operations, would it make sense to perform an in-place sort_indices() before returning the csr_matrix?

Is it necessary?

Unsure. It could matter, depending on how operations are implemented. If it is looped over the indices and some </> are used, then we might skip some indices. This will, in general, depend on the implementation, I guess.

Having just a quick look at our own csr_ operations we use something a bit along these lines in the cython_csr_csrT_matmul function:

https://github.com/alexbovet/flow_stability/blob/8648a8bd715e3b086ed13cb7453d41754c176398/src/flowstab/_cython_sparse_stoch.pyx#L331-L334

j-i-l commented 1 month ago

@alexbovet OK in our implementations it can matter!: If we take the example data from the comment:

import numpy as np
from scipy.sparse import csr_matrix

from flowstab.SparseStochMat import csr_csrT_matmul

n_data = np.array([0.16911086, 0.26525225, 0.04051438, 0.09085281, 0.55254878, 0.82202173,
 0.51055564, 0.50439578, 0.85289212, 0.74480397])  # native
data = np.array([0.16911086, 0.26525225, 0.04051438, 0.09085281, 0.55254878, 0.82202173,
 0.51055564, 0.85289212, 0.50439578, 0.74480397])  # csr_add
# indicies
n_indices = np.array([0, 1, 1, 2, 3, 3, 4, 5, 7, 9])  # native
indices = np.array([0, 1, 1, 2, 3, 3, 4, 7, 5, 9])  # csr_add
# indptr
n_indptr = np.array([ 0, 0,  1,  2,  2,  4,  5,  5,  7, 10, 10])  # native
indptr = np.array([ 0,  0,  1,  2,  2,  4,  5,  5,  7, 10, 10])  # csr_add

An = csr_matrix((n_data, n_indices, n_indptr))
A = csr_matrix((data, indices, indptr))
np.testing.assert_equal(An.toarray(), A.toarray())  # make sure it's the same array repr.

so we have A as unsorted output and An as sorted, and use csr_csrT_matmul:

In [22]: np.testing.assert_equal(csr_csrT_matmul(A, A).toarray(), csr_csrT_matmul(An, An).toarray())
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Cell In[22], line 1
----> 1 np.testing.assert_equal(csr_csrT_matmul(A, A).toarray(), csr_csrT_matmul(An, An).toarray())

   ...
AssertionError:
Arrays are not equal

Mismatched elements: 1 / 100 (1%)
Max absolute difference among violations: 0.2544151
Max relative difference among violations: 0.16557306
 ACTUAL: array([[0.      , 0.      , 0.      , 0.      , 0.      , 0.      ,
        0.      , 0.      , 0.      , 0.      ],
       [0.      , 0.028598, 0.      , 0.      , 0.      , 0.      ,...
 DESIRED: array([[0.      , 0.      , 0.      , 0.      , 0.      , 0.      ,
        0.      , 0.      , 0.      , 0.      ],
       [0.      , 0.028598, 0.      , 0.      , 0.      , 0.      ,...

If we sort the indices of A things are OK again:

In [23]: A.sort_indices()

In [23]: np.testing.assert_equal(csr_csrT_matmul(A, A).toarray(), csr_csrT_matmul(An, An).toarray())

In [24]:

Unless I'm doing something illogical here, it seems to me that we should sort the csr matrices before returning them.

Checking if the input matrices have sorted indices could be a good thing.

alexbovet commented 1 month ago

Okay, I see. It matters for the Cython code I wrote. But I guess the scipy functions take into account that indices may not be sorted. In the end, I don't think I used my Cython implementation of the matrices operations for stoch_mat. I always converted to scipy csr to multiply or add them as it was faster. But anyway, it would be a good idea to force a sorting of the indices before calling any cython functions.