Closed alex4321 closed 7 months ago
Hi! All take a more detailed look tomorrow.
In the meantime, what GPU are you using? We've observed that the 2x8 kernel might fail on older GPUs that don't have enough cache. We haven't properly determined which GPUs are affected, though.
2080Ti
And I commented code to figure out the first place error occures is
cudaFuncSetAttribute(
Code2x8MatVec, cudaFuncAttributeMaxDynamicSharedMemorySize, shared
);
@BlackSamorez
So
void code2x8_matvec_cuda(
const void* __restrict__ A,
const void* __restrict__ B,
void* __restrict__ C,
const void* __restrict__ codebook,
int prob_m,
int prob_k
) {
int sms;
cudaDeviceGetAttribute(&sms, cudaDevAttrMultiProcessorCount, 0);
int waves = 0;
int thread_m;
do {
waves++;
thread_m = ceildiv(prob_m, waves * sms);
} while (thread_m > THREAD_M);
int blocks = ceildiv(prob_m, thread_m);
int threads = 32 * thread_m;
int shared = 16 * (2 * 256 * 8 + 32 * 9);
/*
cudaFuncSetAttribute(
Code2x8MatVec, cudaFuncAttributeMaxDynamicSharedMemorySize, shared
);
Code2x8MatVec<<<blocks, threads, shared>>>(
(const int4*) A,
(const int4*) B,
(int4*) C,
(const int4*) codebook,
prob_m,
prob_k
);
*/
}
output garbage yet works
void code2x8_matvec_cuda(
const void* __restrict__ A,
const void* __restrict__ B,
void* __restrict__ C,
const void* __restrict__ codebook,
int prob_m,
int prob_k
) {
int sms;
cudaDeviceGetAttribute(&sms, cudaDevAttrMultiProcessorCount, 0);
int waves = 0;
int thread_m;
do {
waves++;
thread_m = ceildiv(prob_m, waves * sms);
} while (thread_m > THREAD_M);
int blocks = ceildiv(prob_m, thread_m);
int threads = 32 * thread_m;
int shared = 16 * (2 * 256 * 8 + 32 * 9);
cudaFuncSetAttribute(
Code2x8MatVec, cudaFuncAttributeMaxDynamicSharedMemorySize, shared
);
/*
Code2x8MatVec<<<blocks, threads, shared>>>(
(const int4*) A,
(const int4*) B,
(int4*) C,
(const int4*) codebook,
prob_m,
prob_k
);
*/
}
does not
Well, it seems it is probably my issue
According to the CUDA C Programming Guide, compute capability 7. x devices allow a single thread block to dynamically allocate shared memory up to 64 KB on Turing.
And 2080ti is 7.5
So my maximum shared memory is 64KB which means 65536
bytes
while kernel:
int shared = 16 * (2 * 256 * 8 + 32 * 9);
cudaFuncSetAttribute(
Code2x8MatVec, cudaFuncAttributeMaxDynamicSharedMemorySize, shared
);
tries to set 16 * (2 * 256 * 8 + 32 * 9) = 70144
bytes shared memory size
Looks like there isn't much we can do about it then. The speedup is mostly there due to the codebooks fitting into shared memory.
Concerning your second question regarding the axes order, it's, indeed, different for the Numba
kernels because they profit from a different memory layout compared to all the other kernels. You can see the tensors being transposed once during inference. The code is a mess, and we're hoping to improve both the speed and readability by implementing a proper one-time kernel selector in the near future.
Thanks for pointing the issue source, anyway.
I have Ubuntu 23.10 system.
I installed cudatoolkit 12.1 using https://developer.nvidia.com/cuda-12-1-0-download-archive?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=22.04&target_type=deb_local
(since it need headers and so so I can't just install cuda through conda).
The rest of my environment
AQLM installed from latest github state.
Now if I try to run some code:
it tells me
which I guess it more or less fine.
But:
gives me
p.s. by the way - it is kinda offtopic, but I don't get it:
CUDA_KERNEL.code2x8_matmat
andnumba_gemm_lut
consumecodes
argument as is as well as codebooks