Open Tangesion opened 4 months ago
I have encountered a same issue. so, how to solve it?
This issue has been labeled inactive-30d
due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d
if there is no activity in the next 60 days.
I'm performing a batched matrix multiply of 4bit and I'm getting the error cudaMemcpy result = misaligned address. I followed the As an example, matrix C can be seen as
(0,0,0) | (0,0,1) | (0,0,2) | (1,0,0) | (1,0,1) | (1,0,2) |
(0,1,0) | (0,1,1) | (0,1,2) | (1,1,0) | (1,1,1) | (1,1,2) |
(0,2,0) | (0,2,1) | (0,2,2) | (1,2,0) | (1,2,1) | (1,2,2) |
(0,3,0) | (0,3,1) | (0,3,2) | (1,3,0) | (1,3,1) | (1,3,2) |
(0,4,0) | (0,4,1) | (0,4,2) | (1,4,0) | (1,4,1) | (1,4,2) |
(0,5,0) | (0,5,1) | (0,5,2) | (1,5,0) | (1,5,1) | (1,5,2) |
where we denote each element with (batch_idx, row_idx, column_idx) In this example, batch size is 2, M is 6 and N is 3 The stride (batch_stride_C) between the first element of two batches is ldc * n
matrix A can be seen as
(0,0,0) | (0,0,1) | (1,0,0) | (1,0,1) |
(0,1,0) | (0,1,1) | (1,1,0) | (1,1,1) |
(0,2,0) | (0,2,1) | (1,2,0) | (1,2,1) |
(0,3,0) | (0,3,1) | (1,3,0) | (1,3,1) |
(0,4,0) | (0,4,1) | (1,4,0) | (1,4,1) |
(0,5,0) | (0,5,1) | (1,5,0) | (1,5,1) |
, where batch size is 2, M is 6 and K is 2 The stride (batch_stride_A) between the first element of two batches is lda * k
matrix B can be seen as
(0,0,0) | (0,0,1) | (0,0,2) | ----------------------------- batch 0 (0,1,0) | (0,1,1) | (0,1,2) |
(1,0,0) | (1,0,1) | (1,0,2) | ----------------------------- batch 1 (1,1,0) | (1,1,1) | (1,1,2) |
, where the batch size is 2, N is 3 and K is 2 The stride (batch_stride_B) between the first element of two batches is k This is the format to perform my matrix multiplication, but unlike this example the A and C matrices are changed to row_major to accommodate the 4bit matrix multiplication Here is my code