Closed harryhan618 closed 11 months ago
Thanks for your question!
bs*seq_len
is small (in decode stage where each request only has one token). When bs*seq_len
is large, we essentially want something like CUTLASS's Grouped Gemm kernel but it is non-trivial to modify it to support non-contiguous single adapter weights to align with our memory pool. Therefore we tried Triton. It is now a temporary workaround: we benchmarked it and the triton kernel now cannot actually outperform cutlass's grouped gemm. Improving this is on our TODO list.r
(e.g., 16 and 64 for llama-7b), we believe one warp is enough.@Ying1123 @caoshiyi
Thanks for your explanation. Another question is that does bgmv
outperform triton
kernel on decoding phase ? (Since I saw that during decoding the only kernel used is bgmv
) Thanks.
Hello! Thanks for open-sourcing the repository! I'm learning to write cuda code. So I think I learnt a lot here. I have two questions.
In your paper 5.3, you mentioned the cuda kernels are different between prefilling stage and decode stage. I wander why doing this? I seems that bgmv still do the work if shape
[bs, seqlen, dim]
becomes to[bs * seqlen, dim]
.This question is a much detailed about cuda kernel implementation. In bgmv_multi_lora_rank_expand_kernel line #202, the
feat_in
dimension is reduced through__shfl_down_sync
. I think__shfl_down_sync
only works within one warp. Is this due to the factfeat_in
is relatively small and will not exceed one warp (ie. 32 threads). It also meansfeat_in
should not exceed 32*vec_size
(for fp16,vec_size
is 8) ?