Is there a matrix size limit?

GrandTheftWalrus commented 1 month ago

Update: I just took a gander at the source code and found out that it is "ideal" with matrices smaller than (1000, 1000). Could you perhaps put this on the repository's Readme please?

Original post:

It seems to fail silently on the last step shown here if I have a matrix that is too large:

    #  --------------- Preamble  ---------------
    # Extract the heatmap values into a 2D numpy array
    max_x = max(tile.x for tile in heatmap_tiles) + 1
    max_y = max(tile.y for tile in heatmap_tiles) + 1

    # Create a sparse matrix from heatmap_tiles
    rows = [tile.y for tile in heatmap_tiles]
    cols = [tile.x for tile in heatmap_tiles]
    data = np.ones(len(heatmap_tiles))
    heatmap_matrix = scipy.sparse.coo_matrix((data, (rows, cols)), shape=(max_y, max_x))

    # Generate the circular mask once
    circular_kernel = generate_circular_mask(radius)
    mask_area = np.sum(circular_kernel).astype(np.float32)  # Total number of tiles in the mask
    circular_kernel = circular_kernel / mask_area  # Normalize the kernel

    A = scipy.sparse.rand(100, 100, density=0.1) # for testing

    print("A.shape: ", A.shape) # (100, 100)
    print("heatmap_matrix.shape: ", heatmap_matrix.shape) # (10295, 8975)
    print("circular_kernel.shape: ", circular_kernel.shape) # (21, 21)

    # --------------- relevant part  ---------------

    # Prepare class (it crashes here)
    conv = sc.Toeplitz_convolution2d(
        x_shape=heatmap_matrix.shape, # <--- Le problem
        k=circular_kernel,
        mode='same',
        dtype=np.float32,
    )

If I use A.shape, the function runs in 0.26 seconds and doesn't crash.
Using (1000, 1000) takes 5.39 seconds and doesn't crash
Using (10295, 8975) takes a while and then crashes

Is this library only meant for small matrices?

RichieHakim commented 5 days ago

I just took a gander at the source code and found out that it is "ideal" with matrices smaller than (1000, 1000). Could you perhaps put this on the repository's Readme please? Is this library only meant for small matrices?

Yes and yes, unfortunately. It was an oversight to not put this in the README; I'll add it now. Thank you for your contribution.

This constraint expected behavior due to a limitation of the current approach used by the Toeplitz_convolution2d class. You can see that the class attribute self.dt ('double toeplitz') is a large sparse array with a huge amount of tiled / redundant data. See image below for case when x_shape=(10,10) and k.shape=(3,3).

A simple general solution would be to do it in batches. This will require stitching together outputs. This would be a valuable contribution. A PR solving this would be very welcome from you or anyone else. So, I will leave this issue open for a while. A more serious general solution would be to implement something like memory sharing of the non-zero values in the self.dt sparse array. This would be a near optimal solution but would require some C code or a tweak to how the matrix multiplication proceeds.

RichieHakim commented 3 days ago

I'm happy to put a $20-100 bounty on a solution to this problem

Breakdown:

[ ] $20 for any improvement allowing for parameter sharing in the double Toeplitz matrix: tiling and stitching, basic memory sharing within the sparse array.
[ ] $50 for significant improvements in the time complexity for x_shape or k
[ ] Additional $20-50 if the PR is comprehensive and I don't need to add anything to it (tests, documentation, etc.)

Link to bounty system: \ https://app.opire.dev/issues/01JCPC4RSRT5Q6A9PC3CTMFZJR

Payout upon acceptance of a PR or demonstration of effort. Others are welcome to contribute as well.

RS-labhub commented 2 days ago

The problem looks interesting unless I saw Python as a codebase. I would still like to do some jiggling.

RS-labhub commented 2 days ago

We can use Fourier Transform-based convolution over Toeplitz convolution for dense convolution. It will reduce the t.c to O(nlogn) from O(n^2).

GrandTheftWalrus commented 1 day ago

I'll see if I can give it a go

RichieHakim commented 10 hours ago

Here's a sketch that works for padding = 'same'.

import math

import numpy as np
import scipy.sparse

shape_X = (30, 31)
sparsity_X = 0.05
shape_kernel = (5, 4)

X = scipy.sparse.random(m=shape_X[0], n=shape_X[1], density=sparsity_X, format='csr')
k = np.random.randn(shape_kernel[0], shape_kernel[1])

X_coo = X.tocoo()
k_coo = scipy.sparse.coo_matrix(k)

k_coo.row = k_coo.row - int(math.ceil(k.shape[0] / 2)) + 1
k_coo.col = k_coo.col - int(math.ceil(k.shape[1] / 2)) + 1

out2 = scipy.sparse.coo_matrix((np.array(()), (np.array(()), np.array(()))), shape=X.shape)

out2.row = (k_coo.row[None, :] + X_coo.row[:, None]).reshape(-1)
out2.col = (k_coo.col[None, :] + X_coo.col[:, None]).reshape(-1)
out2.data = (k_coo.data[None, :] * X_coo.data[:, None]).reshape(-1)

idx_valid = (out2.row >= 0) & (out2.row < X.shape[0]) & (out2.col >= 0) & (out2.col < X.shape[1])
out2.row = out2.row[idx_valid]
out2.col = out2.col[idx_valid]
out2.data = out2.data[idx_valid]

out2 = out2.tocsr()
out2.sum_duplicates()

It's not perfect in that there is a large-ish intermediate state, the broadcasted addition step is not very performant, the sum_duplicates step (which is also internal to the .tocsr() call) is not very performant, and the whole thing is a bit susceptible to changes in the scipy.sparse codebase.

I'm happy to call completing this sketch by turning it into a full function with padding options ('full', 'same', 'valid'), good tests, and maybe some benchmarks sufficient to claim the bounty.

RichieHakim commented 6 hours ago

Here's a sketch using the Gather-GEMM-ScatterAdd algorithm. I believe it is the same as the algorithm being used in the spconv v2 library here.

import numpy as np
from scipy.sparse import coo_matrix, csr_matrix
from collections import defaultdict

def sparse_conv2d(input_sp_matrix, kernel, stride=1, padding=0):
    """
    Performs 2D convolution on a sparse input matrix using the
    Gather-GEMM-ScatterAdd algorithm.

    Parameters: input_sp_matrix [scipy.sparse matrix]: 
        Sparse input matrix
    kernel Union[numpy.ndarray, scipy.sparse matrix]:
        Convolution kernel
    stride [int]:
        Convolution stride
    padding [int]:
        Convolution padding

    Returns: output_sp_matrix [scipy.sparse matrix]:
        result
    """
    # Flip the kernel for convolution
    kernel = np.flipud(np.fliplr(kernel))

    # Ensure kernel is a dense numpy array
    if isinstance(kernel, csr_matrix) or isinstance(kernel, coo_matrix):
        kernel = kernel.toarray()
    else:
        kernel = np.asarray(kernel)

    # Get the input shape
    H, W = input_sp_matrix.shape
    kH, kW = kernel.shape

    # Compute output dimensions
    out_H = ((H + 2 * padding - kH) // stride) + 1
    out_W = ((W + 2 * padding - kW) // stride) + 1

    # Pad the input if necessary
    if padding > 0:
        padded_shape = (H + 2 * padding, W + 2 * padding)
        input_padded = coo_matrix(
            (input_sp_matrix.data,
             (input_sp_matrix.row + padding, input_sp_matrix.col + padding)),
            shape=padded_shape
        )
    else:
        input_padded = input_sp_matrix.copy()

    # Convert to COO format for efficient row and column access
    input_coo = input_padded.tocoo()
    nz_rows = input_coo.row
    nz_cols = input_coo.col
    nz_data = input_coo.data
    N = len(nz_data)

    # If no non-zero elements, return an empty sparse matrix
    if N == 0:
        return coo_matrix((out_H, out_W))

    # Get kernel indices and values
    k_indices = np.array(list(zip(*np.nonzero(kernel))))
    k_values = kernel[k_indices[:,0], k_indices[:,1]]
    K = len(k_values)

    if K == 0:
        return coo_matrix((out_H, out_W))

    # Expand input and kernel indices to compute all contributions
    # Each input element will interact with all non-zero kernel elements
    # Shape: (N, K)
    input_rows_expanded = np.repeat(nz_rows, K)
    input_cols_expanded = np.repeat(nz_cols, K)
    input_data_expanded = np.repeat(nz_data, K)

    # Kernel offsets
    u_offsets = k_indices[:,0]
    v_offsets = k_indices[:,1]
    u_offsets = np.tile(u_offsets, N)
    v_offsets = np.tile(v_offsets, N)
    kernel_values_expanded = np.tile(k_values, N)

    # Compute output indices
    i = (input_rows_expanded - u_offsets) // stride
    j = (input_cols_expanded - v_offsets) // stride

    # Check for alignment with stride
    valid_i = (input_rows_expanded - u_offsets) % stride == 0
    valid_j = (input_cols_expanded - v_offsets) % stride == 0
    valid = valid_i & valid_j

    # Apply mask
    i = i[valid]
    j = j[valid]
    contrib = input_data_expanded[valid] * kernel_values_expanded[valid]

    # Filter out contributions that fall outside the output dimensions
    within_bounds = (i >= 0) & (i < out_H) & (j >= 0) & (j < out_W)
    i = i[within_bounds]
    j = j[within_bounds]
    contrib = contrib[within_bounds]

    # If no valid contributions, return an empty sparse matrix
    if len(contrib) == 0:
        return coo_matrix((out_H, out_W))

    # Create the output sparse matrix in COO format
    output_sp_matrix = coo_matrix((contrib, (i, j)), shape=(out_H, out_W))

    # Sum duplicates by converting to CSR and back to COO
    output_sp_matrix.sum_duplicates()

    return output_sp_matrix

It is not optimized for speed yet. Cleaning it up and adding in a jit approach (numba, torch jit script, etc.) would be very valuable. And as above, extending to other padding modes, adding tests, and benchmarks etc. would be much appreciated and worthy of bounty.

GrandTheftWalrus commented 2 hours ago

I've been reading up on the Toeplitz matrix 2D convolution method to prepare for making these optimizations (I'm a lowly B.Sc student but I'm giving it the old college try). I think that the main cause of space complexity (and therefore time complexity due to memory allocation) is that while the double Toeplitz matrix is very sparse, due to how badly it scales with the size of the input matrix, any non-insignificant kernel size (which scales its density linearly) still results in a huge number of nonzero elements needing to be explicitly stored—even though there are only a very small number of unique values. With a 10,000x10,000 input matrix and a 3x3 kernel, there will be 900 million nonzero (and therefore explicitly stored) elements in the double Toeplitz, but they're all repeats of the same 9 values. A 32x32 kernel results in over 100 billion copies of the same 1024 values.

So, I'm trying to see if I can write a sparse matrix class that exploits the redundancy in Toeplitz/double Toeplitz matrices so that we only need to store the kernel values once. The matmul implementation will probably involve something like the Gather-GEMM-ScatterAdd you mentioned. The usage code would hardly need to be changed, too. How does this sound to you?

RichieHakim / sparse_convolution

Is there a matrix size limit? #1

I'm happy to put a $20-100 bounty on a solution to this problem