Open chenzhuofu opened 2 months ago
For the second question, it seems arg_topk
can also return SsmInferenceResult
. We will use arg_topk
as the last operator of the SSM.
For question1, I prefer reserving single op structure for multihead attention, how do you think? @zwang86 @zikun-li
To implement this, I need add a field current_phase
in BatchConfig
, so that op can choose which kernel to execute (PROMPT or GENERATION).
I believe there are different cuda kernels for prefilling and decoding tokens already implemented by @xinhaoc, but those kernels can still be the same operator.. As discussed with @jiazhihao, we want to use those kernels because they are optimized for their use case. Hi @xinhaoc, do you have any thoughts?
For speculative decoding, currently we have current_depth
in TreeSearchBatchConfig
to indicate whether to use prompt kernel (i.e. if current_depth == 0
, it means we should run the prompt kernel).
I believe there are different cuda kernels for prefilling and decoding tokens already implemented by @xinhaoc, but those kernels can still be the same operator.. As discussed with @jiazhihao, we want to use those kernels because they are optimized for their use case. Hi @xinhaoc, do you have any thoughts?
Yes, in CUDA we have different kernel within one multihead_attention operator. What I mean is "should we split them into two operators like higher-level does?" (Now I think it seems no need :P)
For speculative decoding, currently we have
current_depth
inTreeSearchBatchConfig
to indicate whether to use prompt kernel (i.e. ifcurrent_depth == 0
, it means we should run the prompt kernel).
For tree verification, do we have similar method to figure whether in prompt phase?
Related issues
1364 #1361 #1333
Description
We proposed the inference implementation refactoring which mainly involves
Pipeline Split
andStruct Smplification
, and this result some issues to discuss in operators (kernel) changes. I would list them here, and if I miss out something please feel free to correct me~1. For splitting prefilling and decode stages
Previously we mix
prompt
phase andgeneration
phase of caclution in one inference kernel (spec_inc_multihead_self_attention
ortree_inc_multihead_self_attention
). To support split stages we should also spilt mixed caclution.But here's a problem. should we provide prompt and generation as two distinct inference kernel ops, or still provide one op while do conditional branch within it for different stage calculation. The former approach would force change in operators DAG so I think is not good.
2. For smplifing
BatchConfig
structureTrivial changes are adopted. But I haven't fully figured out how we switch from
BeamSearchBC
toTreeSearchBC
.In BeamSearch version, the last layer of ssm is
beam_topk
and its output is stored inBeamInferenceResult
(usingdownload_tensor
). And in TreeSearch versionSsmInferenceResult
is the same asBeamInferenceResult
, so I guess we will still usebeam_topk
.But
beam_topk
use some fields likesub_requests
,beamRequestsInfo::probs
, which removed from updatedTreeSearchBC
. Maybe we can discuss how to adapt it.