OpenBMB / XAgent

An Autonomous LLM Agent for Complex Task Solving
https://blog.x-agent.net/blog/xagent/
Apache License 2.0
7.82k stars 795 forks source link

运行XAgenGen遇到的问题:RuntimeError: PytorchStreamReader failed reading zip archive: invalid header or archive is corrupted #321

Closed Turingforce closed 6 months ago

Turingforce commented 6 months ago

Issue Description / 问题描述

运行XAgenGen遇到的问题

RuntimeError: PytorchStreamReader failed reading zip archive: invalid header or archive is corrupted

Steps to Reproduce / 复现步骤

docker run -it -p 13520:13520 --network tool-server-network -v /mnt/XAgentLlama-7B-preview:/model:rw --gpus all --ipc=host xagentteam/xagentgen:latest python app.py --model-path /model --port 13520

Environment / 环境信息

Error Screenshots or Logs / 错误截图或日志

image

完整的日志

(xagent) root@ubuntu20:~/XAgent# docker run -it -p 13520:13520 --network tool-server-network -v /mnt/XAgentLlama-7B-preview:/model:rw --gpus all --ipc=host xagentteam/xagentgen:latest python app.py --model-path /model --port 13520

========== == CUDA ==

CUDA Version 11.8.0

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License. By pulling and using the container, you accept the terms and conditions of this license: https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

INFO 12-07 08:34:43 llm_engine.py:72] Initializing an LLM engine with config: model='/model', tokenizer='/model', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=16384, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=None, seed=42) Traceback (most recent call last): File "/app/app.py", line 58, in engine = AsyncLLMEngine.from_engine_args(engine_configs) File "/opt/conda/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 486, in from_engine_args engine = cls(parallel_config.worker_use_ray, File "/opt/conda/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 269, in init self.engine = self._init_engine(*args, kwargs) File "/opt/conda/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 305, in _init_engine return engine_class(*args, *kwargs) File "/opt/conda/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 110, in init self._init_workers(distributed_init_method) File "/opt/conda/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 142, in _init_workers self._run_workers( File "/opt/conda/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 700, in _run_workers output = executor(args, kwargs) File "/opt/conda/lib/python3.10/site-packages/vllm/worker/worker.py", line 70, in init_model self.model = get_model(self.model_config) File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/model_loader.py", line 98, in get_model model.load_weights(model_config.model, model_config.download_dir, File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 315, in load_weights for name, loaded_weight in hf_model_weights_iterator( File "/opt/conda/lib/python3.10/site-packages/vllm/model_executor/weight_utils.py", line 250, in hf_model_weights_iterator state = torch.load(bin_file, map_location="cpu") File "/opt/conda/lib/python3.10/site-packages/torch/serialization.py", line 993, in load with _open_zipfile_reader(opened_file) as opened_zipfile: File "/opt/conda/lib/python3.10/site-packages/torch/serialization.py", line 447, in init super().init(torch._C.PyTorchFileReader(name_or_buffer)) RuntimeError: PytorchStreamReader failed reading zip archive: invalid header or archive is corrupted

Cppowboy commented 6 months ago

Please make sure the checkpoint has been downloaded successfully.

Turingforce commented 6 months ago

Please make sure the checkpoint has been downloaded successfully.

OK, I will try

Turingforce commented 6 months ago

Please make sure the checkpoint has been downloaded successfully.

I finally get it done, I suffered a lot by the network condition of my workplace. Thank you.

image