Open GrandTheftWalrus opened 1 month ago
I just took a gander at the source code and found out that it is "ideal" with matrices smaller than (1000, 1000). Could you perhaps put this on the repository's Readme please? Is this library only meant for small matrices?
Yes and yes, unfortunately. It was an oversight to not put this in the README; I'll add it now. Thank you for your contribution.
This constraint expected behavior due to a limitation of the current approach used by the Toeplitz_convolution2d
class. You can see that the class attribute self.dt
('double toeplitz') is a large sparse array with a huge amount of tiled / redundant data. See image below for case when x_shape
=(10,10) and k.shape
=(3,3).
A simple general solution would be to do it in batches. This will require stitching together outputs.
This would be a valuable contribution. A PR solving this would be very welcome from you or anyone else. So, I will leave this issue open for a while.
A more serious general solution would be to implement something like memory sharing of the non-zero values in the self.dt
sparse array. This would be a near optimal solution but would require some C code or a tweak to how the matrix multiplication proceeds.
Breakdown:
Link to bounty system: \ https://app.opire.dev/issues/01JCPC4RSRT5Q6A9PC3CTMFZJR
Payout upon acceptance of a PR or demonstration of effort. Others are welcome to contribute as well.
The problem looks interesting unless I saw Python as a codebase. I would still like to do some jiggling.
We can use Fourier Transform-based convolution over Toeplitz convolution for dense convolution. It will reduce the t.c to O(nlogn) from O(n^2).
I'll see if I can give it a go
Here's a sketch that works for padding = 'same'
.
import math
import numpy as np
import scipy.sparse
shape_X = (30, 31)
sparsity_X = 0.05
shape_kernel = (5, 4)
X = scipy.sparse.random(m=shape_X[0], n=shape_X[1], density=sparsity_X, format='csr')
k = np.random.randn(shape_kernel[0], shape_kernel[1])
X_coo = X.tocoo()
k_coo = scipy.sparse.coo_matrix(k)
k_coo.row = k_coo.row - int(math.ceil(k.shape[0] / 2)) + 1
k_coo.col = k_coo.col - int(math.ceil(k.shape[1] / 2)) + 1
out2 = scipy.sparse.coo_matrix((np.array(()), (np.array(()), np.array(()))), shape=X.shape)
out2.row = (k_coo.row[None, :] + X_coo.row[:, None]).reshape(-1)
out2.col = (k_coo.col[None, :] + X_coo.col[:, None]).reshape(-1)
out2.data = (k_coo.data[None, :] * X_coo.data[:, None]).reshape(-1)
idx_valid = (out2.row >= 0) & (out2.row < X.shape[0]) & (out2.col >= 0) & (out2.col < X.shape[1])
out2.row = out2.row[idx_valid]
out2.col = out2.col[idx_valid]
out2.data = out2.data[idx_valid]
out2 = out2.tocsr()
out2.sum_duplicates()
It's not perfect in that there is a large-ish intermediate state, the broadcasted addition step is not very performant, the sum_duplicates
step (which is also internal to the .tocsr()
call) is not very performant, and the whole thing is a bit susceptible to changes in the scipy.sparse codebase.
I'm happy to call completing this sketch by turning it into a full function with padding options ('full', 'same', 'valid'), good tests, and maybe some benchmarks sufficient to claim the bounty.
Here's a sketch using the Gather-GEMM-ScatterAdd algorithm. I believe it is the same as the algorithm being used in the spconv
v2 library here.
import numpy as np
from scipy.sparse import coo_matrix, csr_matrix
from collections import defaultdict
def sparse_conv2d(input_sp_matrix, kernel, stride=1, padding=0):
"""
Performs 2D convolution on a sparse input matrix using the
Gather-GEMM-ScatterAdd algorithm.
Parameters: input_sp_matrix [scipy.sparse matrix]:
Sparse input matrix
kernel Union[numpy.ndarray, scipy.sparse matrix]:
Convolution kernel
stride [int]:
Convolution stride
padding [int]:
Convolution padding
Returns: output_sp_matrix [scipy.sparse matrix]:
result
"""
# Flip the kernel for convolution
kernel = np.flipud(np.fliplr(kernel))
# Ensure kernel is a dense numpy array
if isinstance(kernel, csr_matrix) or isinstance(kernel, coo_matrix):
kernel = kernel.toarray()
else:
kernel = np.asarray(kernel)
# Get the input shape
H, W = input_sp_matrix.shape
kH, kW = kernel.shape
# Compute output dimensions
out_H = ((H + 2 * padding - kH) // stride) + 1
out_W = ((W + 2 * padding - kW) // stride) + 1
# Pad the input if necessary
if padding > 0:
padded_shape = (H + 2 * padding, W + 2 * padding)
input_padded = coo_matrix(
(input_sp_matrix.data,
(input_sp_matrix.row + padding, input_sp_matrix.col + padding)),
shape=padded_shape
)
else:
input_padded = input_sp_matrix.copy()
# Convert to COO format for efficient row and column access
input_coo = input_padded.tocoo()
nz_rows = input_coo.row
nz_cols = input_coo.col
nz_data = input_coo.data
N = len(nz_data)
# If no non-zero elements, return an empty sparse matrix
if N == 0:
return coo_matrix((out_H, out_W))
# Get kernel indices and values
k_indices = np.array(list(zip(*np.nonzero(kernel))))
k_values = kernel[k_indices[:,0], k_indices[:,1]]
K = len(k_values)
if K == 0:
return coo_matrix((out_H, out_W))
# Expand input and kernel indices to compute all contributions
# Each input element will interact with all non-zero kernel elements
# Shape: (N, K)
input_rows_expanded = np.repeat(nz_rows, K)
input_cols_expanded = np.repeat(nz_cols, K)
input_data_expanded = np.repeat(nz_data, K)
# Kernel offsets
u_offsets = k_indices[:,0]
v_offsets = k_indices[:,1]
u_offsets = np.tile(u_offsets, N)
v_offsets = np.tile(v_offsets, N)
kernel_values_expanded = np.tile(k_values, N)
# Compute output indices
i = (input_rows_expanded - u_offsets) // stride
j = (input_cols_expanded - v_offsets) // stride
# Check for alignment with stride
valid_i = (input_rows_expanded - u_offsets) % stride == 0
valid_j = (input_cols_expanded - v_offsets) % stride == 0
valid = valid_i & valid_j
# Apply mask
i = i[valid]
j = j[valid]
contrib = input_data_expanded[valid] * kernel_values_expanded[valid]
# Filter out contributions that fall outside the output dimensions
within_bounds = (i >= 0) & (i < out_H) & (j >= 0) & (j < out_W)
i = i[within_bounds]
j = j[within_bounds]
contrib = contrib[within_bounds]
# If no valid contributions, return an empty sparse matrix
if len(contrib) == 0:
return coo_matrix((out_H, out_W))
# Create the output sparse matrix in COO format
output_sp_matrix = coo_matrix((contrib, (i, j)), shape=(out_H, out_W))
# Sum duplicates by converting to CSR and back to COO
output_sp_matrix.sum_duplicates()
return output_sp_matrix
It is not optimized for speed yet. Cleaning it up and adding in a jit approach (numba, torch jit script, etc.) would be very valuable. And as above, extending to other padding modes, adding tests, and benchmarks etc. would be much appreciated and worthy of bounty.
I've been reading up on the Toeplitz matrix 2D convolution method to prepare for making these optimizations (I'm a lowly B.Sc student but I'm giving it the old college try). I think that the main cause of space complexity (and therefore time complexity due to memory allocation) is that while the double Toeplitz matrix is very sparse, due to how badly it scales with the size of the input matrix, any non-insignificant kernel size (which scales its density linearly) still results in a huge number of nonzero elements needing to be explicitly stored—even though there are only a very small number of unique values. With a 10,000x10,000 input matrix and a 3x3 kernel, there will be 900 million nonzero (and therefore explicitly stored) elements in the double Toeplitz, but they're all repeats of the same 9 values. A 32x32 kernel results in over 100 billion copies of the same 1024 values.
So, I'm trying to see if I can write a sparse matrix class that exploits the redundancy in Toeplitz/double Toeplitz matrices so that we only need to store the kernel values once. The matmul implementation will probably involve something like the Gather-GEMM-ScatterAdd you mentioned. The usage code would hardly need to be changed, too. How does this sound to you?
Update: I just took a gander at the source code and found out that it is "ideal" with matrices smaller than (1000, 1000). Could you perhaps put this on the repository's Readme please?
Original post:
It seems to fail silently on the last step shown here if I have a matrix that is too large:
Is this library only meant for small matrices?