Alpha-VLLM / LLaMA2-Accessory

An Open-source Toolkit for LLM Development
https://llama2-accessory.readthedocs.io/
Other
2.7k stars 170 forks source link

SPHINX-MoE-1k weights seems to be corrupted #136

Closed FangGet closed 9 months ago

FangGet commented 9 months ago

the weights of SPHINX-MoE-1k you uploaded in hf can not be loaded correctly. This is the output log:

File "/data/mllm/project/LLaMA2-Accessory/accessory/demos/multi_turn_mm.py", line 66, in model_worker
    model = MetaModel.from_pretrained(args.pretrained_path, args.llama_type, args.llama_config, args.tokenizer_path,
  File "/data/mllm/project/LLaMA2-Accessory/accessory/model/meta.py", line 192, in from_pretrained
    load_result = tensor_parallel.load_tensor_parallel_model_list(model, pretrained_path)
  File "/data/mllm/project/LLaMA2-Accessory/accessory/util/tensor_parallel.py", line 466, in load_tensor_parallel_model_list
    state_dict = load_tensor_parallel_model_state_dict(
  File "/data/mllm/project/LLaMA2-Accessory/accessory/util/tensor_parallel.py", line 281, in load_tensor_parallel_model_state_dict
    local_state_dict = _load_checkpoint_and_merge_ranks(
  File "/data/mllm/project/LLaMA2-Accessory/accessory/util/tensor_parallel.py", line 101, in _load_checkpoint_and_merge_ranks
    load_tensor_parallel_shard_state_dict(
  File "/data/mllm/project/LLaMA2-Accessory/accessory/util/tensor_parallel.py", line 219, in load_tensor_parallel_shard_state_dict
    shard = torch.load(shard_path, map_location="cpu")
  File "/opt/conda/envs/accessory/lib/python3.10/site-packages/torch/serialization.py", line 815, in load
    return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
  File "/opt/conda/envs/accessory/lib/python3.10/site-packages/torch/serialization.py", line 1018, in _legacy_load
    return legacy_load(f)
  File "/opt/conda/envs/accessory/lib/python3.10/site-packages/torch/serialization.py", line 904, in legacy_load
    tar.extract('storages', path=tmpdir)
  File "/opt/conda/envs/accessory/lib/python3.10/tarfile.py", line 2295, in extract
    tarinfo = self._get_extract_tarinfo(member, filter_function, path)
  File "/opt/conda/envs/accessory/lib/python3.10/tarfile.py", line 2302, in _get_extract_tarinfo
    tarinfo = self.getmember(member)
  File "/opt/conda/envs/accessory/lib/python3.10/tarfile.py", line 1985, in getmember
    raise KeyError("filename %r not found" % name)
KeyError: "filename 'storages' not found"

Is it convenient for you to upload it again? Thanks.

ChrisLiu6 commented 9 months ago

Thank you for your reminder, we are reuploading the checkpoints and will notify you when it is finished

FangGet commented 9 months ago

OK. there seems only one of eight shard is corrupted. the fourth or the fifth, I'm not sure. Btw, the SPHINX-MoE model is also corrupted, and the meta.json config for model name should be changed from mistral to mixtral.

ChrisLiu6 commented 9 months ago

OK. there seems only one of eight shard is corrupted. the fourth or the fifth, I'm not sure. Btw, the SPHINX-MoE model is also corrupted, and the meta.json config for model name should be changed from mistral to mixtral.

So sorry for the mistake, it has now been fixed. For SPHINX-MoE, please re-download consolidated.07-of-08.model.pth; for SPHINX-MoE-1k, since we previously uploaded intermediate (half-way) checkpoints, we have replaced them with completely trained checkpoints, so please re-download them all.

FangGet commented 9 months ago

ok, thanks so much!!