Add multm_prev_ layer and enhance gemm() function for PLANE_WISE operations

Cydral commented 1 month ago

This pull request introduces a new layer, multm_prev_, and enhances the gemm() function to support PLANE_WISE operations. These changes aim to improve the flexibility and performance of matrix multiplications in deep learning models, particularly for attention mechanisms.

New layer: multmprev

The multm_prev_ layer performs element-wise matrix multiplication between the current layer's input and the previous layer's output. This new layer is particularly useful for implementing attention mechanisms and other operations that require element-wise interactions between tensors.

Key features of multmprev:

Supports PLANE_WISE matrix multiplication
Preserves sample and channel dimensions
Efficiently handles 4D tensors

Enhancement to gemm() function:

The gemm() function has been updated to support two modes of operation: CHANNEL_WISE (default) and PLANE_WISE. This modification allows for more efficient and flexible matrix multiplications, especially when dealing with 4D tensors.

Key changes to gemm():

Added a new parameter g_mode to specify the operation mode (0 for CHANNEL_WISE, 1 for PLANE_WISE)
Implemented PLANE_WISE mode, which performs matrix multiplication for each corresponding 2D plane across all samples and channels
Updated documentation to reflect the new functionality and requirements for both modes

These changes provide greater flexibility in implementing complex neural network architectures, particularly those involving attention mechanisms or other operations requiring element-wise interactions between tensors.

A new test function, test_multm_prev(), has been added to verify the correct functionality of the multm_prev layer and the enhanced gemm() function.

arrufat commented 1 month ago

Nice, I was just wondering if matmul_prev wouldn't be a better name.

Cydral commented 1 month ago

We can change the name without any problem. I am already dealing with compilation issues likely due to static uses of multprev (in the template part), and we will decide on the name to keep afterward.

Cydral commented 1 month ago

On the other hand, I was thinking of using the same convention for the transformation to be applied to softmax and thus have a special layer named softmaxm. So we would take mat_softmax or perhaps better msoftmax?

arrufat commented 1 month ago

Would it be too difficult to have just an attention_ layer? I know that would mean doing the backpropagation by hand inside that layer, just like the loss_barlowtwins is doing it (but that one is just a bn_con).

Cydral commented 1 month ago

It would be simpler to use for some people, but we would lose the flexibility to build attention in a potentially specific way (even though it currently follows fairly standard and structured steps). For instance, we can decide whether or not to mask, apply an additional filter for whether or not to remove a pad token before applying softmax, and so on. I was thinking more of providing, as you did for ResNet, an external definition file that gives a certain definition of the network...

arrufat commented 1 month ago

Yes, we would lose flexibility, or maybe that layer could be initialized with a struct of options that control the behavior/features of the attention layer. But yes, it would still be less flexible.

pfeatherstone commented 1 month ago

It would be harder to implement something like flash attention without an explicit attention_ layer

Cydral commented 1 month ago

Indeed, I can add a high-level declaration in the layer definition file, similar to what was done for the inception layer, like :

template <int embedding_dim, int nb_heads, typename SUBNET>
using attention_ = (...)

davisking commented 1 month ago

Sorry, I'm just catching up on these threads. Seems like this PR is still being worked on? There are conflicts with master in any case. Let me know when I should look it over :)

Cydral commented 1 month ago

@davis, no you can do the merging. I think the conflicts with the master come from the fact that I created several branches from my own Dlib fork to be able to work on several layers in parallel. Any new layers currently being created are finished and can be integrated please. Technically, I still have a new layer to release but I'm going to wait until all the changes have been merged into the master branch to avoid any further conflicts... let me know if that's OK with you.

davisking commented 1 month ago

@davis, no you can do the merging.

You please merge them :)

I'll review them once there aren't merge conflicts.

Technically, I still have a new layer to release but I'm going to wait until all the changes have been merged into the master branch to avoid any further conflicts... let me know if that's OK with you.

Yeah that's fine :D

Cydral commented 2 weeks ago

@Davis, could you please review this PR?

davisking / dlib

Add multm_prev_ layer and enhance gemm() function for PLANE_WISE operations #3020

New layer: multmprev

Enhancement to gemm() function: