HabanaAI / vllm-fork

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
43 stars 58 forks source link

[CI/BUILD] Spec decode ci #524

Open xuechendi opened 6 days ago

xuechendi commented 6 days ago

Add spec decode CI

xuechendi commented 4 days ago

Hi, what is the total time of added tests?

tests/spec_decode/e2e/test_mlp_correctness.py::test_mlp_e2e_greedy_correctness[1-1-128-test_llm_kwargs0-baseline_llm_kwargs0-per_test_common_llm_kwargs0-common_llm_kwargs0] PASSED [ 50%]
tests/spec_decode/e2e/test_mlp_correctness.py::test_mlp_e2e_greedy_correctness[1-32-128-test_llm_kwargs0-baseline_llm_kwargs0-per_test_common_llm_kwargs0-common_llm_kwargs0] PASSED [100%]

===================================================================== warnings summary =====================================================================../../../usr/lib/python3.10/inspect.py:288
  /usr/lib/python3.10/inspect.py:288: FutureWarning: `torch.distributed.reduce_op` is deprecated, please use `torch.distributed.ReduceOp` instead
    return isinstance(object, types.FunctionType)

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
============================================================== 2 passed, 1 warning in 57.45s ===============================================================

real    1m2.861s
user    2m55.088s
sys     0m49.803s

time VLLM_SKIP_WARMUP=True pytest -v tests/spec_decode/e2e/test_medusa_correctness.py::test_medusa_e2e_greedy_correctness
=================================================================== test session starts ====================================================================platform linux -- Python 3.10.12, pytest-8.3.3, pluggy-1.5.0 -- /usr/bin/python3
cachedir: .pytest_cache
rootdir: /workspace/vllm/vllm
configfile: pyproject.toml
plugins: anyio-4.6.2.post1
collected 2 items

tests/spec_decode/e2e/test_medusa_correctness.py::test_medusa_e2e_greedy_correctness[1-1-128-test_llm_kwargs0-baseline_llm_kwargs0-per_test_common_llm_kwargs0-common_llm_kwargs0] PASSED [ 50%]
tests/spec_decode/e2e/test_medusa_correctness.py::test_medusa_e2e_greedy_correctness[1-32-128-test_llm_kwargs0-baseline_llm_kwargs0-per_test_common_llm_kwargs0-common_llm_kwargs0] PASSED [100%]

===================================================================== warnings summary =====================================================================../../../usr/lib/python3.10/inspect.py:288
  /usr/lib/python3.10/inspect.py:288: FutureWarning: `torch.distributed.reduce_op` is deprecated, please use `torch.distributed.ReduceOp` instead
    return isinstance(object, types.FunctionType)

tests/spec_decode/e2e/test_medusa_correctness.py::test_medusa_e2e_greedy_correctness[1-1-128-test_llm_kwargs0-baseline_llm_kwargs0-per_test_common_llm_kwargs0-common_llm_kwargs0]
tests/spec_decode/e2e/test_medusa_correctness.py::test_medusa_e2e_greedy_correctness[1-32-128-test_llm_kwargs0-baseline_llm_kwargs0-per_test_common_llm_kwargs0-common_llm_kwargs0]
  /workspace/vllm/vllm/vllm/model_executor/model_loader/weight_utils.py:425: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
    state = torch.load(bin_file, map_location="cpu")

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
========================================================= 2 passed, 3 warnings in 77.72s (0:01:17) =========================================================

real    1m23.139s
user    3m59.330s
sys     0m57.539s