Jittor / JittorLLMs

计图大模型推理库,具有高性能、配置要求低、中文支持好、可移植等特点
Apache License 2.0
2.36k stars 180 forks source link

could not load the checkpoint,移动了文件夹还是不行 #21

Open GabrielXie opened 1 year ago

GabrielXie commented 1 year ago
(jittor) PS F:\test-code\JittorLLMs> python cli_demo.py pangualpha
WARNING: APEX is not installed, multi_tensor_applier will not be available.
WARNING: APEX is not installed, using torch.nn.LayerNorm instead of apex.normalization.FusedLayerNorm!
F:\test-code\JittorLLMs\models\pangualpha
using world size: 1 and model-parallel size: 1
using torch.float32 for parameters ...
WARNING: overriding default arguments for tokenizer_type:GPT2BPETokenizer                        with tokenizer_type:GPT2BPETokenizer
-------------------- arguments --------------------
  adlr_autoresume ................. False
  adlr_autoresume_interval ........ 1000
  apply_query_key_layer_scaling ... False
  apply_residual_connection_post_layernorm  False
  attention_dropout ............... 0.1
  attention_softmax_in_fp32 ....... False
  batch_size ...................... 1
  bert_load ....................... None
  bias_dropout_fusion ............. False
  bias_gelu_fusion ................ False
  block_data_path ................. None
  checkpoint_activations .......... False
  checkpoint_num_layers ........... 1
  clip_grad ....................... 1.0
  data_impl ....................... infer
  data_path ....................... None
  DDP_impl ........................ local
  distribute_checkpointed_activations  False
  distributed_backend ............. nccl
  dynamic_loss_scale .............. True
  eod_mask_loss ................... False
  eval_interval ................... 1000
  eval_iters ...................... 100
  exit_interval ................... None
  faiss_use_gpu ................... False
  finetune ........................ True
  fp16 ............................ False
  fp16_lm_cross_entropy ........... False
  fp32_allreduce .................. False
  genfile ......................... None
  greedy .......................... False
  hidden_dropout .................. 0.1
  hidden_size ..................... 2560
  hysteresis ...................... 2
  ict_head_size ................... None
  ict_load ........................ None
  indexer_batch_size .............. 128
  indexer_log_interval ............ 1000
  init_method_std ................. 0.02
  layernorm_epsilon ............... 1e-05
  lazy_mpu_init ................... None
  load ............................ C:\Users\xgp\.cache\jittor\jt1.3.7\cl\py3.8.16\Windows-10-10.x52\AMDRyzen75800Xxc8\default\cu11.2.67\checkpoints\pangu\Pangu-alpha_2.6B_fp16_mgt
  local_rank ...................... None
  log_interval .................... 100
  loss_scale ...................... None
  loss_scale_window ............... 1000
  lr .............................. None
  lr_decay_iters .................. None
  lr_decay_style .................. linear
  make_vocab_size_divisible_by .... 1
  mask_prob ....................... 0.15
  max_position_embeddings ......... 1024
  merge_file ...................... None
  min_lr .......................... 0.0
  min_scale ....................... 1
  mmap_warmup ..................... False
  model_parallel_size ............. 1
  no_load_optim ................... False
  no_load_rng ..................... False
  no_save_optim ................... False
  no_save_rng ..................... False
  num_attention_heads ............. 32
  num_layers ...................... 31
  num_samples ..................... 0
  num_unique_layers ............... None
  num_workers ..................... 2
  onnx_safe ....................... None
  openai_gelu ..................... False
  out_seq_length .................. 50
  override_lr_scheduler ........... False
  param_sharing_style ............. grouped
  params_dtype .................... torch.float32
  query_in_block_prob ............. 0.1
  rank ............................ 0
  recompute ....................... False
  report_topk_accuracies .......... []
  reset_attention_mask ............ False
  reset_position_ids .............. False
  sample_input_file ............... None
  sample_output_file .............. None
  save ............................ None
  save_interval ................... None
  scaled_upper_triang_masked_softmax_fusion  False
  seed ............................ 1234
  seq_length ...................... 1024
  short_seq_prob .................. 0.1
  split ........................... 969, 30, 1
  temperature ..................... 1.0
  tensorboard_dir ................. None
  titles_data_path ................ None
  tokenizer_type .................. GPT2BPETokenizer
  top_k ........................... 2
  top_p ........................... 0.0
  train_iters ..................... None
  use_checkpoint_lr_scheduler ..... False
  use_cpu_initialization .......... False
  use_one_sent_docs ............... False
  vocab_file ...................... models/pangualpha/megatron/tokenizer/bpe_4w_pcl/vocab
  warmup .......................... 0.01
  weight_decay .................... 0.01
  world_size ...................... 1
---------------- end of arguments ----------------
> building GPT2BPETokenizer tokenizer ...
 > padded vocab (size: 40000) with 0 dummy tokens (new size: 40000)
torch distributed is already initialized, skipping initialization ...
> initializing model parallel with size 1
> setting random seeds to 1234 ...
> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
building GPT2 model ...
 > number of parameters on model parallel rank 0: 2625295360
global rank 0 is loading checkpoint C:\Users\xgp\.cache\jittor\jt1.3.7\cl\py3.8.16\Windows-10-10.x52\AMDRyzen75800Xxc8\default\cu11.2.67\checkpoints\pangu\Pangu-alpha_2.6B_fp16_mgt\iter_0001000\mp_rank_00\model_optim_rng.pth
could not load the checkpoint

image 从文件夹外移动过来了,还是不行

cjld commented 1 year ago

您的checkpoint是不是下载失败了,您删除重新下载试试

GabrielXie commented 1 year ago

您的checkpoint是不是下载失败了,您删除重新下载试试

还是不行,我删掉了checkpoints 目录下的pangu但是仍然不行,下面是日志:

(jittor) PS F:\test-code\JittorLLMs> python web_demo.py  pangualpha
Downloading https://cg.cs.tsinghua.edu.cn/jittor/pangu/assets/build/checkpoints/model_optim_rng.pth to C:\Users\xgp\.cache\jittor\jt1.3.7\cl\py3.8.16\Windows-10-10.x52\AMDRyzen75800Xxc8\default\cu11.2.67\checkpoints\pangu/model_optim_rng.pth
4.89GB [01:47, 49.0MB/s]
WARNING: APEX is not installed, multi_tensor_applier will not be available.
WARNING: APEX is not installed, using torch.nn.LayerNorm instead of apex.normalization.FusedLayerNorm!
F:\test-code\JittorLLMs\models\pangualpha
using world size: 1 and model-parallel size: 1
using torch.float32 for parameters ...
WARNING: overriding default arguments for tokenizer_type:GPT2BPETokenizer                        with tokenizer_type:GPT2BPETokenizer
-------------------- arguments --------------------
  adlr_autoresume ................. False
  adlr_autoresume_interval ........ 1000
  apply_query_key_layer_scaling ... False
  apply_residual_connection_post_layernorm  False
  attention_dropout ............... 0.1
  attention_softmax_in_fp32 ....... False
  batch_size ...................... 1
  bert_load ....................... None
  bias_dropout_fusion ............. False
  bias_gelu_fusion ................ False
  block_data_path ................. None
  checkpoint_activations .......... False
  checkpoint_num_layers ........... 1
  clip_grad ....................... 1.0
  data_impl ....................... infer
  data_path ....................... None
  DDP_impl ........................ local
  distribute_checkpointed_activations  False
  distributed_backend ............. nccl
  dynamic_loss_scale .............. True
  eod_mask_loss ................... False
  eval_interval ................... 1000
  eval_iters ...................... 100
  exit_interval ................... None
  faiss_use_gpu ................... False
  finetune ........................ True
  fp16 ............................ False
  fp16_lm_cross_entropy ........... False
  fp32_allreduce .................. False
  genfile ......................... None
  greedy .......................... False
  hidden_dropout .................. 0.1
  hidden_size ..................... 2560
  hysteresis ...................... 2
  ict_head_size ................... None
  ict_load ........................ None
  indexer_batch_size .............. 128
  indexer_log_interval ............ 1000
  init_method_std ................. 0.02
  layernorm_epsilon ............... 1e-05
  lazy_mpu_init ................... None
  load ............................ C:\Users\xgp\.cache\jittor\jt1.3.7\cl\py3.8.16\Windows-10-10.x52\AMDRyzen75800Xxc8\default\cu11.2.67\checkpoints\pangu\Pangu-alpha_2.6B_fp16_mgt
  local_rank ...................... None
  log_interval .................... 100
  loss_scale ...................... None
  loss_scale_window ............... 1000
  lr .............................. None
  lr_decay_iters .................. None
  lr_decay_style .................. linear
  make_vocab_size_divisible_by .... 1
  mask_prob ....................... 0.15
  max_position_embeddings ......... 1024
  merge_file ...................... None
  min_lr .......................... 0.0
  min_scale ....................... 1
  mmap_warmup ..................... False
  model_parallel_size ............. 1
  no_load_optim ................... False
  no_load_rng ..................... False
  no_save_optim ................... False
  no_save_rng ..................... False
  num_attention_heads ............. 32
  num_layers ...................... 31
  num_samples ..................... 0
  num_unique_layers ............... None
  num_workers ..................... 2
  onnx_safe ....................... None
  openai_gelu ..................... False
  out_seq_length .................. 50
  override_lr_scheduler ........... False
  param_sharing_style ............. grouped
  params_dtype .................... torch.float32
  query_in_block_prob ............. 0.1
  rank ............................ 0
  recompute ....................... False
  report_topk_accuracies .......... []
  reset_attention_mask ............ False
  reset_position_ids .............. False
  sample_input_file ............... None
  sample_output_file .............. None
  save ............................ None
  save_interval ................... None
  scaled_upper_triang_masked_softmax_fusion  False
  seed ............................ 1234
  seq_length ...................... 1024
  short_seq_prob .................. 0.1
  split ........................... 969, 30, 1
  temperature ..................... 1.0
  tensorboard_dir ................. None
  titles_data_path ................ None
  tokenizer_type .................. GPT2BPETokenizer
  top_k ........................... 2
  top_p ........................... 0.0
  train_iters ..................... None
  use_checkpoint_lr_scheduler ..... False
  use_cpu_initialization .......... False
  use_one_sent_docs ............... False
  vocab_file ...................... models/pangualpha/megatron/tokenizer/bpe_4w_pcl/vocab
  warmup .......................... 0.01
  weight_decay .................... 0.01
  world_size ...................... 1
---------------- end of arguments ----------------
> building GPT2BPETokenizer tokenizer ...
 > padded vocab (size: 40000) with 0 dummy tokens (new size: 40000)
torch distributed is already initialized, skipping initialization ...
> initializing model parallel with size 1
> setting random seeds to 1234 ...
> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
building GPT2 model ...
 > number of parameters on model parallel rank 0: 2625295360
global rank 0 is loading checkpoint C:\Users\xgp\.cache\jittor\jt1.3.7\cl\py3.8.16\Windows-10-10.x52\AMDRyzen75800Xxc8\default\cu11.2.67\checkpoints\pangu\Pangu-alpha_2.6B_fp16_mgt\iter_0001000\mp_rank_00\model_optim_rng.pth
could not load the checkpoint