Open TaihuLight opened 5 years ago
Sorry for my late reply.
I don't think 'batched' is the right wording here. That is typically used to indicate an operation that is repeated multiple times but on independent data. In your case the SUM
variable is shared, right? So the iterations of the 'batched' loop are not actually independent of each other.
I think what you are looking for is perhaps something like the XSUM routine from CLBlast, but than from 3D (batches of 2D matrices) to 2D (a single 2D matrix) rather than 1D (a vector) to 0D (a scalar). Perhaps if you see your 2D matrices as a flat vector and you re-organize your data, something like in #349 could fit your need?
Could you have a look at the latest reply in #349 regarding a solution with GEMV? i think this solves your issue as well, since you can just use GEMV, set the x
vector to all 1's equal in size to the amount of values you want to sum, and use the other dimension (either m
or n
depending on how the data is currently layed-out in memory) as the size of the matrix. For example, set a_transposed = true
, m = num_batches
(the number of sums you want to do), and n = height_of_C * width_of_C
(the matrix C flattened).
Could you let me know if this works for you?
https://github.com/CNugteren/CLBlast/issues/346 However, the stride of 0 for
C
in CLBlastSgemmStridedBatched() is also useful in deep learning applicaiton such as the implementations of the backfoward of convolution layer. It can be implemented with two steps: (1) use CLBlastSgemmStridedBatched() to compute batched matricesC
; (2) add a new routine (e.g., named StridedBatchedAddMatrix) to compute the sum of each batch of the computed matrixC
in Step (1). Thus, the new routine (e.g., named StridedBatchedAddMatrix) for adding all batched matrices with element-wised to reducing the results of CLBlastSgemmStridedBatched(). That is, the new routine is used to compute the sum of batched matrices as following:where c_stride is the stride between two batches of the
Therefore, the new routine for adding matrices is similar with the routine
C
matrix, and batchedC
matrices is computed with CLBlastSgemmStridedBatched(). For instance,xAXPYStridedBATCHED: StridedBatched version of AXPY
for adding vectors.