could not load the checkpoint，移动了文件夹还是不行

Jittor / JittorLLMs

计图大模型推理库，具有高性能、配置要求低、中文支持好、可移植等特点

Apache License 2.0

2.36k stars 180 forks source link

(jittor) PS F:\test-code\JittorLLMs> python cli_demo.py pangualpha WARNING: APEX is not installed, multi_tensor_applier will not be available. WARNING: APEX is not installed, using torch.nn.LayerNorm instead of apex.normalization.FusedLayerNorm! F:\test-code\JittorLLMs\models\pangualpha using world size: 1 and model-parallel size: 1 using torch.float32 for parameters ... WARNING: overriding default arguments for tokenizer_type:GPT2BPETokenizer with tokenizer_type:GPT2BPETokenizer -------------------- arguments -------------------- adlr_autoresume ................. False adlr_autoresume_interval ........ 1000 apply_query_key_layer_scaling ... False apply_residual_connection_post_layernorm False attention_dropout ............... 0.1 attention_softmax_in_fp32 ....... False batch_size ...................... 1 bert_load ....................... None bias_dropout_fusion ............. False bias_gelu_fusion ................ False block_data_path ................. None checkpoint_activations .......... False checkpoint_num_layers ........... 1 clip_grad ....................... 1.0 data_impl ....................... infer data_path ....................... None DDP_impl ........................ local distribute_checkpointed_activations False distributed_backend ............. nccl dynamic_loss_scale .............. True eod_mask_loss ................... False eval_interval ................... 1000 eval_iters ...................... 100 exit_interval ................... None faiss_use_gpu ................... False finetune ........................ True fp16 ............................ False fp16_lm_cross_entropy ........... False fp32_allreduce .................. False genfile ......................... None greedy .......................... False hidden_dropout .................. 0.1 hidden_size ..................... 2560 hysteresis ...................... 2 ict_head_size ................... None ict_load ........................ None indexer_batch_size .............. 128 indexer_log_interval ............ 1000 init_method_std ................. 0.02 layernorm_epsilon ............... 1e-05 lazy_mpu_init ................... None load ............................ C:\Users\xgp\.cache\jittor\jt1.3.7\cl\py3.8.16\Windows-10-10.x52\AMDRyzen75800Xxc8\default\cu11.2.67\checkpoints\pangu\Pangu-alpha_2.6B_fp16_mgt local_rank ...................... None log_interval .................... 100 loss_scale ...................... None loss_scale_window ............... 1000 lr .............................. None lr_decay_iters .................. None lr_decay_style .................. linear make_vocab_size_divisible_by .... 1 mask_prob ....................... 0.15 max_position_embeddings ......... 1024 merge_file ...................... None min_lr .......................... 0.0 min_scale ....................... 1 mmap_warmup ..................... False model_parallel_size ............. 1 no_load_optim ................... False no_load_rng ..................... False no_save_optim ................... False no_save_rng ..................... False num_attention_heads ............. 32 num_layers ...................... 31 num_samples ..................... 0 num_unique_layers ............... None num_workers ..................... 2 onnx_safe ....................... None openai_gelu ..................... False out_seq_length .................. 50 override_lr_scheduler ........... False param_sharing_style ............. grouped params_dtype .................... torch.float32 query_in_block_prob ............. 0.1 rank ............................ 0 recompute ....................... False report_topk_accuracies .......... [] reset_attention_mask ............ False reset_position_ids .............. False sample_input_file ............... None sample_output_file .............. None save ............................ None save_interval ................... None scaled_upper_triang_masked_softmax_fusion False seed ............................ 1234 seq_length ...................... 1024 short_seq_prob .................. 0.1 split ........................... 969, 30, 1 temperature ..................... 1.0 tensorboard_dir ................. None titles_data_path ................ None tokenizer_type .................. GPT2BPETokenizer top_k ........................... 2 top_p ........................... 0.0 train_iters ..................... None use_checkpoint_lr_scheduler ..... False use_cpu_initialization .......... False use_one_sent_docs ............... False vocab_file ...................... models/pangualpha/megatron/tokenizer/bpe_4w_pcl/vocab warmup .......................... 0.01 weight_decay .................... 0.01 world_size ...................... 1 ---------------- end of arguments ---------------- > building GPT2BPETokenizer tokenizer ... > padded vocab (size: 40000) with 0 dummy tokens (new size: 40000) torch distributed is already initialized, skipping initialization ... > initializing model parallel with size 1 > setting random seeds to 1234 ... > initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234 building GPT2 model ... > number of parameters on model parallel rank 0: 2625295360 global rank 0 is loading checkpoint C:\Users\xgp\.cache\jittor\jt1.3.7\cl\py3.8.16\Windows-10-10.x52\AMDRyzen75800Xxc8\default\cu11.2.67\checkpoints\pangu\Pangu-alpha_2.6B_fp16_mgt\iter_0001000\mp_rank_00\model_optim_rng.pth could not load the checkpoint

您的checkpoint是不是下载失败了，您删除重新下载试试

还是不行，我删掉了checkpoints 目录下的pangu但是仍然不行，下面是日志：

(jittor) PS F:\test-code\JittorLLMs> python web_demo.py  pangualpha
Downloading https://cg.cs.tsinghua.edu.cn/jittor/pangu/assets/build/checkpoints/model_optim_rng.pth to C:\Users\xgp\.cache\jittor\jt1.3.7\cl\py3.8.16\Windows-10-10.x52\AMDRyzen75800Xxc8\default\cu11.2.67\checkpoints\pangu/model_optim_rng.pth
4.89GB [01:47, 49.0MB/s]
WARNING: APEX is not installed, multi_tensor_applier will not be available.
WARNING: APEX is not installed, using torch.nn.LayerNorm instead of apex.normalization.FusedLayerNorm!
F:\test-code\JittorLLMs\models\pangualpha
using world size: 1 and model-parallel size: 1
using torch.float32 for parameters ...
WARNING: overriding default arguments for tokenizer_type:GPT2BPETokenizer                        with tokenizer_type:GPT2BPETokenizer
-------------------- arguments --------------------
  adlr_autoresume ................. False
  adlr_autoresume_interval ........ 1000
  apply_query_key_layer_scaling ... False
  apply_residual_connection_post_layernorm  False
  attention_dropout ............... 0.1
  attention_softmax_in_fp32 ....... False
  batch_size ...................... 1
  bert_load ....................... None
  bias_dropout_fusion ............. False
  bias_gelu_fusion ................ False
  block_data_path ................. None
  checkpoint_activations .......... False
  checkpoint_num_layers ........... 1
  clip_grad ....................... 1.0
  data_impl ....................... infer
  data_path ....................... None
  DDP_impl ........................ local
  distribute_checkpointed_activations  False
  distributed_backend ............. nccl
  dynamic_loss_scale .............. True
  eod_mask_loss ................... False
  eval_interval ................... 1000
  eval_iters ...................... 100
  exit_interval ................... None
  faiss_use_gpu ................... False
  finetune ........................ True
  fp16 ............................ False
  fp16_lm_cross_entropy ........... False
  fp32_allreduce .................. False
  genfile ......................... None
  greedy .......................... False
  hidden_dropout .................. 0.1
  hidden_size ..................... 2560
  hysteresis ...................... 2
  ict_head_size ................... None
  ict_load ........................ None
  indexer_batch_size .............. 128
  indexer_log_interval ............ 1000
  init_method_std ................. 0.02
  layernorm_epsilon ............... 1e-05
  lazy_mpu_init ................... None
  load ............................ C:\Users\xgp\.cache\jittor\jt1.3.7\cl\py3.8.16\Windows-10-10.x52\AMDRyzen75800Xxc8\default\cu11.2.67\checkpoints\pangu\Pangu-alpha_2.6B_fp16_mgt
  local_rank ...................... None
  log_interval .................... 100
  loss_scale ...................... None
  loss_scale_window ............... 1000
  lr .............................. None
  lr_decay_iters .................. None
  lr_decay_style .................. linear
  make_vocab_size_divisible_by .... 1
  mask_prob ....................... 0.15
  max_position_embeddings ......... 1024
  merge_file ...................... None
  min_lr .......................... 0.0
  min_scale ....................... 1
  mmap_warmup ..................... False
  model_parallel_size ............. 1
  no_load_optim ................... False
  no_load_rng ..................... False
  no_save_optim ................... False
  no_save_rng ..................... False
  num_attention_heads ............. 32
  num_layers ...................... 31
  num_samples ..................... 0
  num_unique_layers ............... None
  num_workers ..................... 2
  onnx_safe ....................... None
  openai_gelu ..................... False
  out_seq_length .................. 50
  override_lr_scheduler ........... False
  param_sharing_style ............. grouped
  params_dtype .................... torch.float32
  query_in_block_prob ............. 0.1
  rank ............................ 0
  recompute ....................... False
  report_topk_accuracies .......... []
  reset_attention_mask ............ False
  reset_position_ids .............. False
  sample_input_file ............... None
  sample_output_file .............. None
  save ............................ None
  save_interval ................... None
  scaled_upper_triang_masked_softmax_fusion  False
  seed ............................ 1234
  seq_length ...................... 1024
  short_seq_prob .................. 0.1
  split ........................... 969, 30, 1
  temperature ..................... 1.0
  tensorboard_dir ................. None
  titles_data_path ................ None
  tokenizer_type .................. GPT2BPETokenizer
  top_k ........................... 2
  top_p ........................... 0.0
  train_iters ..................... None
  use_checkpoint_lr_scheduler ..... False
  use_cpu_initialization .......... False
  use_one_sent_docs ............... False
  vocab_file ...................... models/pangualpha/megatron/tokenizer/bpe_4w_pcl/vocab
  warmup .......................... 0.01
  weight_decay .................... 0.01
  world_size ...................... 1
---------------- end of arguments ----------------
> building GPT2BPETokenizer tokenizer ...
 > padded vocab (size: 40000) with 0 dummy tokens (new size: 40000)
torch distributed is already initialized, skipping initialization ...
> initializing model parallel with size 1
> setting random seeds to 1234 ...
> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
building GPT2 model ...
 > number of parameters on model parallel rank 0: 2625295360
global rank 0 is loading checkpoint C:\Users\xgp\.cache\jittor\jt1.3.7\cl\py3.8.16\Windows-10-10.x52\AMDRyzen75800Xxc8\default\cu11.2.67\checkpoints\pangu\Pangu-alpha_2.6B_fp16_mgt\iter_0001000\mp_rank_00\model_optim_rng.pth
could not load the checkpoint

Jittor / JittorLLMs

could not load the checkpoint，移动了文件夹还是不行 #21