flexflow / FlexFlow

FlexFlow Serve: Low-Latency, High-Performance LLM Serving
https://flexflow.readthedocs.io
Apache License 2.0
1.68k stars 224 forks source link

Error when I use larger batch size for spec-infer #1491

Open lhr-30 opened 1 month ago

lhr-30 commented 1 month ago

The spec-infer works well for batch size (1,2,4,8,16). But I change the batch size to 32, it turns out to be "stack smashing detected"

+ ngpus=1
+ fsize=30000
+ zsize=60000
+ max_sequence_length=256
+ max_tokens_per_batch=512
+ llm_model_name=huggyllama/llama-7b
+ ssm_model_name=JackFram/llama-68m
+ for bs in "${batch_sizes[@]}"
+ ./FlexFlow/build/inference/spec_infer/spec_infer -ll:cpu 16 -ll:util 16 -ll:gpu 1 -ll:fsize 30000 -ll:zsize 60000 -llm-model huggyllama/llama-7b -ssm-model JackFram/llama-68m -prompt ./FlexFlow/inference/prompt/chatgpt_32.json --verbose --max-requests-per-batch 32 --max-sequence-length 256 --max-tokens-per-batch 512 -tensor-parallelism-degree 1 --fusion -output-file ./FlexFlow/inference/output/server_small-32_batchsize-tree_specinfer_tree_16core.txt
Applying fusion optimizations during compilation...
424 operators before fusion...
198 operators after fusion...
Applying fusion optimizations during compilation...
35 operators before fusion...
18 operators after fusion...
*** stack smashing detected ***: terminated
./server_gpu_experiments.sh: line 31: 1088568 Aborted                 (core dumped) ./FlexFlow/build/inference/spec_infer/spec_infer -ll:cpu $ncpus -ll:util $ncpus -ll:gpu $ngpus -ll:fsize $fsize -ll:zsize $zsize -llm-model $llm_model_name -ssm-model $ssm_model_name -prompt ./FlexFlow/inference/prompt/chatgpt_$bs.json --verbose --max-requests-per-batch $bs --max-sequence-length $max_sequence_length --max-tokens-per-batch $max_tokens_per_batch -tensor-parallelism-degree $ngpus --fusion -output-file ./FlexFlow/inference/output/server_small-${bs}_batchsize-tree_specinfer_tree_16core.txt > ./FlexFlow/inference/output/server_small-${bs}_batchsize-tree_specinfer_tree_16core.ou

when I set the number of cpu cores to 1, it will stuck. Probably at here ./Flexflow/src/runtime/request_manager.cc::283:

if (get_num_ssms() == 0) {
    xxx
  } else {
    std::cout << "Num of SSMs: " << get_num_ssms() << std::endl;
    for (int i = 0; i < get_num_ssms(); i++) {
      BeamTree beam_tree = BeamTree{};
      request.beam_trees.push_back(beam_tree);
    }
  }

  pending_request_queue.push(request);
  all_requests[request.guid] = request;
  {
    const std::lock_guard<std::mutex> lock(request_to_promise_mutex);
    request_to_promise[request.guid] = new std::promise<void>();
  }

  {
    std::string output = "New request tokens:";
    output = "[" + std::to_string(request.guid) + "]" + output;
    for (int i = 0; i < request.tokens.size(); i++) {
      output = output + " " + std::to_string(request.tokens[i]);
    }
    log_req_mgr.print("%s", output.c_str());
  }

below is the log:

[0 - 7efdb03fc000]    1.025782 {3}{RequestManager}: [1011486]New request tokens: 1 14350 263 26228 21256 1048 7535 17770 363 596 10462 29889
[0]14350
[1]263
[2]26228
[3]21256
[4]1048
[5]7535
[6]17770
[7]363
[8]596
[9]10462
[10]29889
Num of SSMs: 1

stuck at the prompt the last "Write a short re-engagement email for a newsletter that's about tips for starting an online business. Use a friendly tone."

SeungjaeLim commented 1 week ago

I am also stuck on this issue. Have you found any solution?