CNugteren / CLBlast

Tuned OpenCL BLAS
Apache License 2.0
1.03k stars 203 forks source link

New routine for the stride of 0 for C in CLBlastSgemmStridedBatched() is need #347

Open TaihuLight opened 5 years ago

TaihuLight commented 5 years ago

https://github.com/CNugteren/CLBlast/issues/346 However, the stride of 0 for C in CLBlastSgemmStridedBatched() is also useful in deep learning applicaiton such as the implementations of the backfoward of convolution layer. It can be implemented with two steps: (1) use CLBlastSgemmStridedBatched() to compute batched matrices C; (2) add a new routine (e.g., named StridedBatchedAddMatrix) to compute the sum of each batch of the computed matrix C in Step (1). Thus, the new routine (e.g., named StridedBatchedAddMatrix) for adding all batched matrices with element-wised to reducing the results of CLBlastSgemmStridedBatched(). That is, the new routine is used to compute the sum of batched matrices as following:

for(int i=0; i< batch_count; i++){
SUM = SUM + β* ( C + i * c_stride ) 
} 

where c_stride is the stride between two batches of the C matrix, and batched C matrices is computed with CLBlastSgemmStridedBatched(). For instance, _20190105200133 Therefore, the new routine for adding matrices is similar with the routine xAXPYStridedBATCHED: StridedBatched version of AXPY for adding vectors.

CNugteren commented 5 years ago

Sorry for my late reply.

I don't think 'batched' is the right wording here. That is typically used to indicate an operation that is repeated multiple times but on independent data. In your case the SUM variable is shared, right? So the iterations of the 'batched' loop are not actually independent of each other.

I think what you are looking for is perhaps something like the XSUM routine from CLBlast, but than from 3D (batches of 2D matrices) to 2D (a single 2D matrix) rather than 1D (a vector) to 0D (a scalar). Perhaps if you see your 2D matrices as a flat vector and you re-organize your data, something like in #349 could fit your need?

CNugteren commented 5 years ago

Could you have a look at the latest reply in #349 regarding a solution with GEMV? i think this solves your issue as well, since you can just use GEMV, set the x vector to all 1's equal in size to the amount of values you want to sum, and use the other dimension (either m or n depending on how the data is currently layed-out in memory) as the size of the matrix. For example, set a_transposed = true, m = num_batches (the number of sums you want to do), and n = height_of_C * width_of_C (the matrix C flattened).

Could you let me know if this works for you?