Oblique Splitting API convergence - Projection Matrix vs Sampling + Transformation

adam2392 commented 3 years ago

I'm trying to write out the GaborMorfSplitter and I'm wondering if we might converge on a consistent API naming scheme for purposes of generalizing from a Base Oblique Splitter:

We start off with our sample data, X (n_samples, n_dims). Our goal is to generate a new matrix, X' that is (n_samples, proj_dims). where proj_dims dictates the how many projections we'll consider. In my view, there are two distinct "steps" done to get our projected data:

1. sampling a patch: (in the paper, this is called a projection matrix), but I feel like a patch makes more semantic sense. The patch doesn't need to even be contiguous! A patch is just defined as a sub-collection of data points (possibly random, or structured). 2. applying a transformation on said patch: In SPORF, this is linear combination, in MORF this is summation.

This then results in "one" number per data sample, per projected dimension, which can be fed into a splitting criterion across samples (i.e. Gini)

My Questions

instead of calling things a projection matrix, can we split this into "sample patch" and "apply_transform" across the board? Right now in SPORF these are implicitly combined into 1 step (probably for speed)

This will then allow any MORF to inherit from a ORF easily.

For your reference if you want some more details

In SPORF

In SPORF, one is interested in i) sampling a "random patch" and then ii) computing a "transformation" on this patch.

The patch is simply dictated by the sparsity of the projection matrix and is generated randomly. The transformation is then a linear combination of the selected patch.

In naive-MORF

One also generates a patch and then computes a transformation on said patch.

The patch now is dictated by some contiguous patch on the image, and the size/location is generated randomly. The transformation is then the summation of the selected patch.

In Gabor-MORF

Same thing, where I currently:

sample a random patch (I use the patch selection procedure in naive-MORF)
apply a "transformation": convolve the patch with random Gabor kernel, and take the summation

adam2392 commented 3 years ago

cc: @parthgvora @ChesterHuynh

adam2392 commented 3 years ago

cc: @jovo @rflperry in case you guys had strong opinions.

rflperry commented 3 years ago

To me, the patch is the set of sampled features in the structured space, not applicable to non-structured SPORF. The projection matrix is the n_projections x n_features matrix for the candidate feature calculation, whose rows are the flattened patches.

As long as a summation is taken, this strategy should work as you can weight the projection_matrix (1's for MORF, +-1 for SPORF, kernel weights for Gabor, potentially any user-specified filter).

The benefit in my mind for explicit construction of the projection matrix is for computational reasons as I believe it to be the fastest method. Once it's made, you can just do matrix multiplication instead of looping over samples and applying transforms to each. In high dimensional settings, I can imagine the projection matrix being quite sparse and thus maybe it isn't as efficient though, but I'd imagine there to be an efficient solution to that.

adam2392 commented 3 years ago

The issue that arises in Gabor is that the way it's applied to a 2D array is through convolution generally and the parameters of the Gabor Kernel changes the size of the kernel itself.

E.g.

from skimage.filters import gabor_kernel

frequency = 1.0
kernel = gabor_kernel(frequency=frequency, n_stds=1, bandwidth=1)
print(kernel.shape)
>>> (3, 3)

frequency = 0.5
kernel = gabor_kernel(frequency=frequency, n_stds=1, bandwidth=1)
print(kernel.shape)
>>> (5,5)

frequency = 0.2
kernel = gabor_kernel(frequency=frequency, n_stds=3, bandwidth=1)
print(kernel.shape)
>>> (19, 19)

If we wanted to set the selected patch with the weights according to a Gabor kernel, I'm realizing that that's not well defined.

Rather instead, one needs to do i) convolution and then ii) summarize up the result into one number similar to how SPORF works. Using np.convolve(patch_of_data, kernel, mode='same'), will output now a 2D array with the same size as the patch_of_data that's the kernel convolved with this patch of data (maybe the patch in this case is the entire image). Then we can take the summation of this.

I agree that the explicit construction of the projection_matrix in SPORF is definitely fastest (since it's matrix mult), and under-the-hood this should still be done more or less, but rn I don't see a clean way to generalize without breaking this up into a patch and transform step in code/description. Because in Gabor-morf, you can't just sample a patch and set the weights according to a Gabor kernel to do this matrix multiplication. You need to sample a Gabor kernel and convolve that with your selected patch and then sum it up. (The convolution itself can of course be vectorized to be a matrix multiplication).

Lmk if I'm misinterpreting somewhere.

rflperry commented 3 years ago

I'm not too familiar with gabor filters and skimage, but I would think that if the way you've framed it as sum(filter \conv patch) then there exists (not saying it's easy to compute) a vector of feature weights such that the summation is equivalent vector(patch) \matmul weights which is what the projection matrix is. Since convolution is just a weighted sum of the features?

See sparse convolutional coding by Jeremias https://arxiv.org/abs/1607.02005

You can represent the convolution as a matrix multiplication

adam2392 commented 3 years ago

Let me know if this makes sense:

X (n, height, width)
dimension: d = height * width
projected_dim: d' << d
patch: (height', width')
kernel: (k, k)

If we want say the Gabor kernel as our "weights", the kernel is defined irrespective of image height/width. You just feed in params, and you get a kernel. To make what you're describing functionally, I think the issue is:

select a patch, say (5, 4) region on the image. You can't just weight a Gabor kernel in this region to my knowledge. Cuz if you parametrize the kernel and choose a random one, you might get a say (10, 10) kernel. What do you do now?

I think... this might only be an issue for Gabor filters, because I can imagine if we use a exponential kernel, we pre-select our patch, and we just get an exponential fall-off in weights starting at 1 in the patch. (i.e. the kernel is parametrized by the patch).

rflperry commented 3 years ago

Yep fair point. So the issue is more that a patch after convolution with the gabor filter (and potentially plenty of other filters) incorporates information from features outside of that patch (since the kernel centers at a pixel but uses the neighborhood).

Is it possible that MORF only wants a filter at a given pixel, rather than the sum of a filter o patch of pixels?
Even if features outside of the patch are used, it's still a weighted combination of features, just more features than the immediate patch.

So pseudocode.

Sample patch (.e. 5,4)
sample kernel (i.e. 10, 10)
Create padded patch using patch + kernel dimensions
Obtain sum of kernel weights at each pixel in padded patch (each pixel in the original patch will be weighted by the sum of (10, 10) kernel values. Each pixel from the padded region will be weighted by a subset of the kernel values.)
Create projection matrix row with each feature getting its sum of weights from above.

adam2392 commented 3 years ago

Hmm okay yeah that makes sense with the jank version I have rn. I guess unfortunately for some of these "spatial" filters, there's no current way that we can come up w/ that can bunch the convolution step + summation into a single matrix multiply step...

rflperry commented 3 years ago

Well, give a feature vector x \in R^(1 x d) for which a patch of height H and width W is being sampled post convolution and summed, that's the new feature = sum(x \conv filter)

We can represent our convolution as a matrix C \in R^(d x (HW)) and summation of HW features is just a vector 1 \in (HW x 1). So our sum of convolution patch is x * C * 1 and thus C * 1 \in (d x 1) is a row of our projection matrix. The hard part is just computing the convolution matrix elegantly, but I think Jeremias's paper does it nicely.

adam2392 commented 3 years ago

I think I figured it out!

My original issue was that I only had a way of obtaining the convolution was a "full" discrete convolution, which changed the shape of the output, so hard to pick the patch. Now, I think I have some code that can do "same" discrete convolution, where padding is done smartly to make the output of x \ conv filter to be same shape as x was. The matrix multiplication operations as a result are slightly different, and it's a bit of a juggle to get it generic.

Thus now, one can do a patch pick, so the projection matrix is: P = K_conv_matrix * S, where S is the vector doing patch-selection (1's at patch location, and 0 elsewhere). It's unfortunately a lot of super detailed Toeplitz operations, so I think the good thing is we can actually abstract this entirely into pytorch if we're willing to make that dependency! :).

https://pytorch.org/docs/stable/nn.functional.html


# sub-sampled data points
input = X.reshape(sample_inds, 2, image_height, image_width)

# kernels we're interested in (note we can generalize this to many kernels I think 
# simultaneously! just modify input and output # of channels.
# here 2 channels (1 for real component and 1 for imaginary)
weight = kernel.reshape(2, 2, kernel_height, kernel_width)

# output is now a 4D tensor (batch, channels, height, width)
output_tensor = conv2d(input, weight, padding=same_padding).numpy()

# perform patch_selection
patch_weights = output_tensor[:, :, patch_idxs]

# set projection matrix with the patch weights now at the correct location
proj_mat = np.zeros((n_features, proj_dims))
proj_mat[patch_idxs, idx] = patch_weights

Otherwise we have toeplitz matrix multiplication code that's pretty complex and a pain to unit-test. In addition, if we can figure out how to smartly "parallelize" this, then we can easily perform all the convolution steps on GPU, which basically makes this step trivial computationally. The only thing to be cautious of is overhead in passing back/forth data from CPU<->GPU.

Some notes on the above topic, while I'm thinking about it:

ideally we pass a copy of the tensorized data into GPU at beginning of forest building (keep it there?)
if we can just pass in sample indices to GPU, and we can slice the tensorized data on GPU without having to slice it on CPU side and pass to GPU, then this is desirable
the only thing left to be passed to GPU at each node in trees are the kernels themselves to apply the convolution operation.
I imagine that passing to GPU is only worth it if i) the kernel is big enough and ii) the original data dimensionality is big enough, otw probably slower then just doing pure CPU.

NeuroDataDesign / manifold_random_forests