Open zikun-li opened 2 months ago
The new version of the BatchConfig
:
class BatchConfig {
public:
static int const MAX_NUM_REQUESTS = 64;
static int const MAX_NUM_TOKENS = 1024;
static int const MAX_SPEC_TREE_TOKEN_NUM = 64;
int num_tokens;
int num_available requests; // Added
struct PerRequestInfo {
int first_token_index_in_request;
int first_token_offset_in_batch;
int num_tokens_in_batch;
};
struct PerTokenInfo {
int abs_index_in_request;
int request_index;
TokenId token_id;
};
BitMask causalMask[MAX_NUM_REQUESTS];
PerRequestInfo requestsInfo[MAX_NUM_REQUESTS];
PerTokenInfo tokensInfo[MAX_NUM_TOKENS];
bool request_available[MAX_NUM_REQUESTS]; // Name changed
};
Field to remove:
num_generation_tokens
.PerRequestInfo.max_sequence_length
.PerRequestInfo.batch_config_request_id
.PerRequestInfo.prompt_phase
.PerRequestInfo.request_guid
.request_running
.Change name:
request_complete
-> request_available
.Add:
BatchConfig
.The new version of TreeSearchBatchConfig
:
class TreeSearchBatchConfig : public BatchConfig {
public:
inline static int const MAX_SPECULATIVE_TREE_BRANCHES = 3;
inline static int const MAX_TREE_DEPTH = 16;
int current_depth = 0;
int model_id;
};
Field to remove:
tree_requests_info
.speculative_request_num
: because we decide to save this in BatchConfig
.I would take the kernel part development.
Got a question,
I found some usage of batch_config_request_id
in .cu code, and now it is replaced by RequestManager::guid_of_requests
. Should I assume that request_manager will take the duty of copying guid_of_requests
field into IncMultiHeadSelfAttentionMeta
(just like token_infos
and request_infos
)?
@zikun-li
batch_config_request_id
is a compact list of indices of the available requests. In the new BatchConfig
class we have an equivalent field request_available
, which is a indicator field of whether different requests are available. E.g. If we have a batch with 4 slots where slot 0, 2 and 3 have available requests, batch_config_request_id
will be the compact list [0, 2, 3]
, and request_available
will be [true, false, true, true]
. Another difference is that in the previous BatchConfig
, batch_config_request_id
is scattered in different PerRequestInfo
in requestsInfo
. In the above example, requestsInfo[0].batch_config_request_id
would be 0
, and similarly requestsInfo[1].batch_config_request_id = 2, requestsInfo[2].batch_config_request_id = 3
. I believe we can adept the implementation to using request_available
instead of batch_config_request_id
since they contain virtually the same information. @aetiurf
Yes, we do can use request_available
to calculate the $i$th available request index. However my concern is here: https://github.com/flexflow/FlexFlow/blob/inference/src/ops/inc_multihead_self_attention.cu#L83
If we switch to request_available
we will obtain the index as:
int index = 0;
while(request_idx) {
while(!request_available[index]){
++index;
}
--request_idx;
}
And this would result into $O(N)$ caclution in kernel thread (which previously $O(1)$).
I agree with the idea of keeping this batch_config_request_id
, at here we launch num_heads * num_requests(which equals to num_generation_tokens in decoding phase)
threadblocks, if switch to request_available
and not including the for-loop zhuofu mentioned above , it means we need to launch num_heads * num_max_requests
threadblocks, which results in a larger kernel.
That's a valid concern. But I don't think we need to launch num_heads * num_max_requests
kernels, because in the new verison of BatchConfig
we have a field BatchConfig.num_available_requests
that indicates how many available requests are there in the batch. Does this solves the issue? @xinhaoc
@aetiurf For this part, it seems we don't need to worry too much about this since N
is typically very small. We can also maintain a list like batch_config_request_id
, but I think that might be a little bit redundant. What do you think? @zwang86
That's a valid concern. But I don't think we need to launch
num_heads * num_max_requests
kernels, because in the new verison ofBatchConfig
we have a fieldBatchConfig.num_available_requests
that indicates how many available requests are there in the batch. Does this solves the issue? @xinhaoc@aetiurf For this part, it seems we don't need to worry too much about this since
N
is typically very small. We can also maintain a list likebatch_config_request_id
, but I think that might be a little bit redundant. What do you think? @zwang86
Ah, I mix up the MAX_NUM_REQUESTS = 1024
, for 64 it should be no much performance degradation. Then I have no problem here.
Actually it could be even smaller (e.g. 8 or 16).
Yes, we do can use
request_available
to calculate the $i$th available request index. However my concern is here: https://github.com/flexflow/FlexFlow/blob/inference/src/ops/inc_multihead_self_attention.cu#L83If we switch to
request_available
we will obtain the index as:int index = 0; while(request_idx) { while(!request_available[index]){ ++index; } --request_idx; }
And this would result into O(N) caclution in kernel thread (which previously $O(1)$).
ANNOTATION:
Using code snippet above will make GPU thread totally divergent, so remember adding a trailing __syncthreads()
.
Is speculative_request_num
now BatchConfig::num_available_request
?
And is BeamSearchPerRequestInfo::sub_request_num
now the constant MAX_SPECULATIVE_TREE_BRANCHES
?
Please help me verify this :P
In prefilling phase, BatchConfig::num_available_request
should be 1, while in the decoding phase, BatchConfig::num_available_requests
should be the same as speculative_request_num
.
As we plan to move some states from the
BatchConfig
to theRequestManager
, some fields inBatchConfig
are rendered redundant. The following are the data members of the currentBatchConfig
.The following fields seem redundant:
num_generation_tokens
: in previous versions, the prefilling and decoding phase of the small model is mixed, and this field is used to record the number of tokens generated by the small model in the decoding phase. Now we separate the prefilling and decoding phase of the small model, all tokens are generated in the decoding phase, and the number can be found innum_tokens
.PerRequestInfo.max_sequence_length
: I think this should be a field ofRequestManager
.PerRequestInfo.batch_config_request_id
: we can store a mapping from the index of a request in the batch to the guid of the request.request_completed
: This is stored inRequestManager
.request_running
: This is stored inRequestManager
.There are also some redundancies in the current
TreeSearchBatchConfig
, the following are the current data members:As the base class
BatchConfig
already has a field inPerRequestInfo
callednum_tokens_in_batch
, we don't need the structTreeSearchPerRequestInfo
, because the only field in it,num_tokens_at_depth
is equivalent tonum_tokens_in_batch
.Please let me know if any of the fields listed above is not redundant. Otherwise, let's remove them. We can discuss to move other fields into
RequestManager
if you have other suggestions.