InternLM / lmdeploy

LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
https://lmdeploy.readthedocs.io/en/latest/
Apache License 2.0
4.33k stars 390 forks source link

[side-effect]Fix param `--cache-max-entry-count` is not taking effect (#1758) #1778

Closed QwertyJack closed 3 months ago

QwertyJack commented 3 months ago

Motivation

The --cache-max-entry-count parameter is not working correctly, causing GPU RAM to be exceeded unexpectedly.

Modification

Add an initialization step immediately following the creation of the default TurbomindModelConfig object.

Checklist

  1. Pre-commit or other linting tools are used to fix the potential lint issues.
  2. The modification is covered by complete unit tests. If not, please add more unit tests to ensure the correctness.
  3. If the modification has a dependency on downstream projects of a newer version, this PR should be tested with all supported versions of downstream projects.
  4. The documentation has been modified accordingly, like docstring or example tutorials.
lvhan028 commented 3 months ago

Sorry, this issue is brought by my PR #1702

lvhan028 commented 3 months ago

The following tc failed

def test_turbomind_from_hf():
    model_path = 'internlm/internlm2-chat-7b'
    engine_config = TurbomindEngineConfig(model_format='hf',
                                          tp=2,
                                          session_len=4000,
                                          max_batch_size=100,
                                          cache_max_entry_count=0.5,
                                          quant_policy=8,
                                          rope_scaling_factor=3.0,
                                          use_logn_attn=True,
                                          max_prefill_iters=64,
                                          num_tokens_per_iter=256)

    output_model_name, cfg = get_output_model_registered_name_and_config(model_path, model_format='hf', group_size=0)
    config = TurbomindModelConfig.from_engine_config(engine_config)
    config.update(cfg)

    assert(config.tensor_para_size == engine_config.tp)
    assert(config.session_len == engine_config.session_len)
    assert(config.max_batch_size == engine_config.max_batch_size)
    assert(config.cache_max_entry_count == engine_config.cache_max_entry_count)
    assert(config.quant_policy == engine_config.quant_policy)
    assert(config.rope_scaling_factor == engine_config.rope_scaling_factor)
    assert(config.use_logn_attn == engine_config.use_logn_attn)
    assert(config.max_prefill_iters == engine_config.max_prefill_iters)
    assert(config.num_tokens_per_iter == engine_config.num_tokens_per_iter)