Open lhr-30 opened 1 month ago
The spec-infer works well for batch size (1,2,4,8,16). But I change the batch size to 32, it turns out to be "stack smashing detected"
+ ngpus=1 + fsize=30000 + zsize=60000 + max_sequence_length=256 + max_tokens_per_batch=512 + llm_model_name=huggyllama/llama-7b + ssm_model_name=JackFram/llama-68m + for bs in "${batch_sizes[@]}" + ./FlexFlow/build/inference/spec_infer/spec_infer -ll:cpu 16 -ll:util 16 -ll:gpu 1 -ll:fsize 30000 -ll:zsize 60000 -llm-model huggyllama/llama-7b -ssm-model JackFram/llama-68m -prompt ./FlexFlow/inference/prompt/chatgpt_32.json --verbose --max-requests-per-batch 32 --max-sequence-length 256 --max-tokens-per-batch 512 -tensor-parallelism-degree 1 --fusion -output-file ./FlexFlow/inference/output/server_small-32_batchsize-tree_specinfer_tree_16core.txt Applying fusion optimizations during compilation... 424 operators before fusion... 198 operators after fusion... Applying fusion optimizations during compilation... 35 operators before fusion... 18 operators after fusion... *** stack smashing detected ***: terminated ./server_gpu_experiments.sh: line 31: 1088568 Aborted (core dumped) ./FlexFlow/build/inference/spec_infer/spec_infer -ll:cpu $ncpus -ll:util $ncpus -ll:gpu $ngpus -ll:fsize $fsize -ll:zsize $zsize -llm-model $llm_model_name -ssm-model $ssm_model_name -prompt ./FlexFlow/inference/prompt/chatgpt_$bs.json --verbose --max-requests-per-batch $bs --max-sequence-length $max_sequence_length --max-tokens-per-batch $max_tokens_per_batch -tensor-parallelism-degree $ngpus --fusion -output-file ./FlexFlow/inference/output/server_small-${bs}_batchsize-tree_specinfer_tree_16core.txt > ./FlexFlow/inference/output/server_small-${bs}_batchsize-tree_specinfer_tree_16core.ou
when I set the number of cpu cores to 1, it will stuck. Probably at here ./Flexflow/src/runtime/request_manager.cc::283:
if (get_num_ssms() == 0) { xxx } else { std::cout << "Num of SSMs: " << get_num_ssms() << std::endl; for (int i = 0; i < get_num_ssms(); i++) { BeamTree beam_tree = BeamTree{}; request.beam_trees.push_back(beam_tree); } } pending_request_queue.push(request); all_requests[request.guid] = request; { const std::lock_guard<std::mutex> lock(request_to_promise_mutex); request_to_promise[request.guid] = new std::promise<void>(); } { std::string output = "New request tokens:"; output = "[" + std::to_string(request.guid) + "]" + output; for (int i = 0; i < request.tokens.size(); i++) { output = output + " " + std::to_string(request.tokens[i]); } log_req_mgr.print("%s", output.c_str()); }
below is the log:
[0 - 7efdb03fc000] 1.025782 {3}{RequestManager}: [1011486]New request tokens: 1 14350 263 26228 21256 1048 7535 17770 363 596 10462 29889 [0]14350 [1]263 [2]26228 [3]21256 [4]1048 [5]7535 [6]17770 [7]363 [8]596 [9]10462 [10]29889 Num of SSMs: 1
stuck at the prompt the last "Write a short re-engagement email for a newsletter that's about tips for starting an online business. Use a friendly tone."
I am also stuck on this issue. Have you found any solution?
The spec-infer works well for batch size (1,2,4,8,16). But I change the batch size to 32, it turns out to be "stack smashing detected"
when I set the number of cpu cores to 1, it will stuck. Probably at here ./Flexflow/src/runtime/request_manager.cc::283:
below is the log:
stuck at the prompt the last "Write a short re-engagement email for a newsletter that's about tips for starting an online business. Use a friendly tone."