RuntimeError: Failed to load model file , please make sure model file is saved with the following APIs: save_params, save_persistables, save_vars

zjykzj commented 4 years ago

Hi PaddleOCR, i want to finetune pretrained model using rec_chinese_common_train.yml, but error happens

environment

I just downloaded the latest repo in gitee.com and downloaded the pretrained model

and install requirements

$ python -m pip install -r requirements.txt
# 如果您的机器安装的是CUDA10，请运行以下命令安装
$python3 -m pip install paddlepaddle-gpu==1.7.2.post107 -i https://pypi.tuna.tsinghua.edu.cn/simple

reproduce

put the pretrained model into PaddleOCR, like this

$ tree pretrain_models/
pretrain_models/
├── ch_rec_r34_vd_crnn
│   ├── best_accuracy.pdmodel
│   ├── best_accuracy.pdopt
│   ├── best_accuracy.pdparams
│   ├── model
│   └── params
├── ch_rec_r34_vd_crnn_infer.tar
└── ch_rec_r34_vd_crnn.tar

1 directory, 7 files

use tar xf ... to extract it

run the command like this:

$ CUDA_VISIBLE_DEVICES=1 python3 tools/train.py -c configs/rec/rec_chinese_common_train_zhonglian.yml -o Global.pretrain_weights=./pretrain_models/ch_rec_r34_vd_crnn/
2020-08-17 11:09:31,142-INFO: {'Global': {'debug': False, 'algorithm': 'CRNN', 'use_gpu': True, 'epoch_num': 3000, 'log_smooth_window': 20, 'print_batch_step': 10, 'save_model_dir': './output/rec_CRNN_zhonglian', 'save_epoch_step': 3, 'eval_batch_step': 2000, 'train_batch_size_per_card': 128, 'test_batch_size_per_card': 128, 'image_shape': [3, 32, 320], 'max_text_length': 25, 'character_type': 'ch', 'character_dict_path': './ppocr/utils/ppocr_keys_v1.txt', 'loss_type': 'ctc', 'reader_yml': './configs/rec/rec_chinese_reader.yml', 'pretrain_weights': './pretrain_models/ch_rec_r34_vd_crnn/', 'checkpoints': None, 'save_inference_dir': None, 'infer_img': None, 'distort': True}, 'Architecture': {'function': 'ppocr.modeling.architectures.rec_model,RecModel'}, 'Backbone': {'function': 'ppocr.modeling.backbones.rec_resnet_vd,ResNet', 'layers': 34}, 'Head': {'function': 'ppocr.modeling.heads.rec_ctc_head,CTCPredict', 'encoder_type': 'rnn', 'SeqRNN': {'hidden_size': 256}}, 'Loss': {'function': 'ppocr.modeling.losses.rec_ctc_loss,CTCLoss'}, 'Optimizer': {'function': 'ppocr.optimizer,AdamDecay', 'base_lr': 0.0005, 'beta1': 0.9, 'beta2': 0.999}, 'TrainReader': {'reader_function': 'ppocr.data.rec.dataset_traversal,SimpleReader', 'num_workers': 8, 'img_set_dir': './train_data', 'label_file_path': './train_data/rec_gt_train.txt'}, 'EvalReader': {'reader_function': 'ppocr.data.rec.dataset_traversal,SimpleReader', 'img_set_dir': './train_data', 'label_file_path': './train_data/rec_gt_test.txt'}, 'TestReader': {'reader_function': 'ppocr.data.rec.dataset_traversal,SimpleReader'}}
2020-08-17 11:09:33,025-INFO: places would be ommited when DataLoader is not iterable
W0817 11:09:34.215936 25802 device_context.cc:237] Please NOTE: device: 0, CUDA Capability: 75, Driver API Version: 10.2, Runtime API Version: 10.0
W0817 11:09:34.220605 25802 device_context.cc:245] device: 0, cuDNN Version: 7.6.
2020-08-17 11:09:35,960-INFO: Loading parameters from ./pretrain_models/ch_rec_r34_vd_crnn/...
2020-08-17 11:09:35,960-WARNING: ./pretrain_models/ch_rec_r34_vd_crnn/.pdparams not found, try to load model file saved with [ save_params, save_persistables, save_vars ]
2020-08-17 11:09:35,960-WARNING: ./pretrain_models/ch_rec_r34_vd_crnn/.pdparams not found, try to load model file saved with [ save_params, save_persistables, save_vars ]
/home/zj/anaconda3/lib/python3.7/site-packages/paddle/fluid/executor.py:789: UserWarning: The following exception is not an EOF exception.
  "The following exception is not an EOF exception.")
Traceback (most recent call last):
  File "/home/zj/anaconda3/lib/python3.7/site-packages/paddle/fluid/io.py", line 1865, in load_program_state
    filename=file_name)
  File "/home/zj/anaconda3/lib/python3.7/site-packages/paddle/fluid/io.py", line 793, in load_vars
    executor.run(load_prog)
  File "/home/zj/anaconda3/lib/python3.7/site-packages/paddle/fluid/executor.py", line 790, in run
    six.reraise(*sys.exc_info())
  File "/home/zj/anaconda3/lib/python3.7/site-packages/six.py", line 693, in reraise
    raise value
  File "/home/zj/anaconda3/lib/python3.7/site-packages/paddle/fluid/executor.py", line 785, in run
    use_program_cache=use_program_cache)
  File "/home/zj/anaconda3/lib/python3.7/site-packages/paddle/fluid/executor.py", line 838, in _run_impl
    use_program_cache=use_program_cache)
  File "/home/zj/anaconda3/lib/python3.7/site-packages/paddle/fluid/executor.py", line 912, in _run_program
    fetch_var_name)
paddle.fluid.core_avx.EnforceNotMet: 

--------------------------------------------
C++ Call Stacks (More useful to developers):
--------------------------------------------
0   std::string paddle::platform::GetTraceBackString<std::string const&>(std::string const&, char const*, int)
1   paddle::platform::EnforceNotMet::EnforceNotMet(std::string const&, char const*, int)
2   paddle::framework::DeserializeFromStream(std::istream&, paddle::framework::LoDTensor*, paddle::platform::DeviceContext const&)
3   paddle::operators::LoadOpKernel<paddle::platform::CPUDeviceContext, float>::LoadLodTensor(std::istream&, paddle::platform::Place const&, paddle::framework::Variable*, paddle::framework::ExecutionContext const&) const
4   paddle::operators::LoadOpKernel<paddle::platform::CPUDeviceContext, float>::Compute(paddle::framework::ExecutionContext const&) const
5   std::_Function_handler<void (paddle::framework::ExecutionContext const&), paddle::framework::OpKernelRegistrarFunctor<paddle::platform::CPUPlace, false, 0ul, paddle::operators::LoadOpKernel<paddle::platform::CPUDeviceContext, float>, paddle::operators::LoadOpKernel<paddle::platform::CPUDeviceContext, double>, paddle::operators::LoadOpKernel<paddle::platform::CPUDeviceContext, int>, paddle::operators::LoadOpKernel<paddle::platform::CPUDeviceContext, signed char>, paddle::operators::LoadOpKernel<paddle::platform::CPUDeviceContext, long> >::operator()(char const*, char const*, int) const::{lambda(paddle::framework::ExecutionContext const&)#1}>::_M_invoke(std::_Any_data const&, paddle::framework::ExecutionContext const&)
6   paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, paddle::platform::Place const&, paddle::framework::RuntimeContext*) const
7   paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, paddle::platform::Place const&) const
8   paddle::framework::OperatorBase::Run(paddle::framework::Scope const&, paddle::platform::Place const&)
9   paddle::framework::Executor::RunPreparedContext(paddle::framework::ExecutorPrepareContext*, paddle::framework::Scope*, bool, bool, bool)
10  paddle::framework::Executor::Run(paddle::framework::ProgramDesc const&, paddle::framework::Scope*, int, bool, bool, std::vector<std::string, std::allocator<std::string> > const&, bool, bool)

------------------------------------------
Python Call Stacks (More useful to users):
------------------------------------------
  File "/home/zj/anaconda3/lib/python3.7/site-packages/paddle/fluid/framework.py", line 2525, in append_op
    attrs=kwargs.get("attrs", None))
  File "/home/zj/anaconda3/lib/python3.7/site-packages/paddle/fluid/io.py", line 773, in load_vars
    attrs={'file_path': os.path.join(dirname, new_var.name)})
  File "/home/zj/anaconda3/lib/python3.7/site-packages/paddle/fluid/io.py", line 1865, in load_program_state
    filename=file_name)
  File "tools/../ppocr/utils/save_load.py", line 56, in _load_state
    state = fluid.io.load_program_state(path)
  File "tools/../ppocr/utils/save_load.py", line 78, in load_params
    state = _load_state(path)
  File "tools/../ppocr/utils/save_load.py", line 124, in init_model
    load_params(exe, program, path)
  File "tools/train.py", line 82, in main
    init_model(config, train_program, exe)
  File "tools/train.py", line 121, in <module>
    main()

----------------------
Error Message Summary:
----------------------
InvalidArgumentError: tensor version 118883584 is not supported, Only version 0 is supported
  [Hint: Expected version == 0U, but received version:118883584 != 0U:0.] at (/paddle/paddle/fluid/framework/lod_tensor.cc:287)
  [operator < load > error]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "tools/train.py", line 121, in <module>
    main()
  File "tools/train.py", line 82, in main
    init_model(config, train_program, exe)
  File "tools/../ppocr/utils/save_load.py", line 124, in init_model
    load_params(exe, program, path)
  File "tools/../ppocr/utils/save_load.py", line 78, in load_params
    state = _load_state(path)
  File "tools/../ppocr/utils/save_load.py", line 56, in _load_state
    state = fluid.io.load_program_state(path)
  File "/home/zj/anaconda3/lib/python3.7/site-packages/paddle/fluid/io.py", line 1868, in load_program_state
    "Failed to load model file , please make sure model file is saved with the "
RuntimeError: Failed to load model file , please make sure model file is saved with the following APIs: save_params, save_persistables, save_vars

i don't know what's wrong and how to fix it. Looking forward to your help

zjykzj commented 4 years ago

it looks probably solved problem by me !!! I try to do something different

$ mv best_accuracy.pdparams .pdparams
$ mv best_accuracy.pdopt .pdopt
$ mv best_accuracy.pdmodel .pdmodel
$ ls -al
total 576188
drwxrwxr-x 2 zj zj      4096 8月  17 11:22 .
drwxrwxr-x 3 zj zj      4096 8月  17 11:08 ..
-rw-r--r-- 1 zj zj       120 6月  10 14:10 ._.DS_Store
-rw-r--r-- 1 zj zj      6148 6月  10 14:10 .DS_Store
-rw-r--r-- 1 zj zj    309172 5月  21 15:47 model
-rw-r--r-- 1 zj zj 109418179 5月  21 15:47 params
-rw-r--r-- 1 zj zj   1002504 5月  20 19:47 .pdmodel
-rw-r--r-- 1 zj zj 314376719 5月  20 19:47 .pdopt
-rw-r--r-- 1 zj zj 164864866 5月  20 19:47 .pdparams

then everything is fine, the program works

$ CUDA_VISIBLE_DEVICES=1 python3 tools/train.py -c configs/rec/rec_chinese_common_train_zhonglian.yml -o Global.pretrain_weights=./pretrain_models/ch_rec_r34_vd_crnn/
2020-08-17 11:23:08,667-INFO: {'Global': {'debug': False, 'algorithm': 'CRNN', 'use_gpu': True, 'epoch_num': 3000, 'log_smooth_window': 20, 'print_batch_step': 10, 'save_model_dir': './output/rec_CRNN_zhonglian', 'save_epoch_step': 3, 'eval_batch_step': 2000, 'train_batch_size_per_card': 128, 'test_batch_size_per_card': 128, 'image_shape': [3, 32, 320], 'max_text_length': 25, 'character_type': 'ch', 'character_dict_path': './ppocr/utils/ppocr_keys_v1.txt', 'loss_type': 'ctc', 'reader_yml': './configs/rec/rec_chinese_reader.yml', 'pretrain_weights': './pretrain_models/ch_rec_r34_vd_crnn/', 'checkpoints': None, 'save_inference_dir': None, 'infer_img': None, 'distort': True}, 'Architecture': {'function': 'ppocr.modeling.architectures.rec_model,RecModel'}, 'Backbone': {'function': 'ppocr.modeling.backbones.rec_resnet_vd,ResNet', 'layers': 34}, 'Head': {'function': 'ppocr.modeling.heads.rec_ctc_head,CTCPredict', 'encoder_type': 'rnn', 'SeqRNN': {'hidden_size': 256}}, 'Loss': {'function': 'ppocr.modeling.losses.rec_ctc_loss,CTCLoss'}, 'Optimizer': {'function': 'ppocr.optimizer,AdamDecay', 'base_lr': 0.0005, 'beta1': 0.9, 'beta2': 0.999}, 'TrainReader': {'reader_function': 'ppocr.data.rec.dataset_traversal,SimpleReader', 'num_workers': 8, 'img_set_dir': './train_data', 'label_file_path': './train_data/rec_gt_train.txt'}, 'EvalReader': {'reader_function': 'ppocr.data.rec.dataset_traversal,SimpleReader', 'img_set_dir': './train_data', 'label_file_path': './train_data/rec_gt_test.txt'}, 'TestReader': {'reader_function': 'ppocr.data.rec.dataset_traversal,SimpleReader'}}
2020-08-17 11:23:10,885-INFO: places would be ommited when DataLoader is not iterable
W0817 11:23:11.969848 13241 device_context.cc:237] Please NOTE: device: 0, CUDA Capability: 75, Driver API Version: 10.2, Runtime API Version: 10.0
W0817 11:23:11.975134 13241 device_context.cc:245] device: 0, cuDNN Version: 7.6.
2020-08-17 11:23:14,579-INFO: Loading parameters from ./pretrain_models/ch_rec_r34_vd_crnn/...
2020-08-17 11:23:16,366-INFO: Finish initing model from ./pretrain_models/ch_rec_r34_vd_crnn/
I0817 11:23:16.393851 13241 parallel_executor.cc:440] The Program will be executed on CUDA using ParallelExecutor, 1 cards are used, so 1 programs are executed in parallel.
I0817 11:23:16.425503 13241 build_strategy.cc:365] SeqOnlyAllReduceOps:0, num_trainers:1
I0817 11:23:16.490049 13241 parallel_executor.cc:307] Inplace strategy is enabled, when build_strategy.enable_inplace = True
I0817 11:23:16.526840 13241 parallel_executor.cc:375] Garbage collection strategy is enabled, when FLAGS_eager_delete_tensor_gb = 0
2020-08-17 11:23:32,039-INFO: epoch: 1, iter: 10, lr: 0.000500, 'loss': 153.48695, 'acc': 0.78125, time: 0.983
2020-08-17 11:23:44,815-INFO: epoch: 2, iter: 20, lr: 0.000500, 'loss': 18.81405, 'acc': 0.960938, time: 0.976
2020-08-17 11:23:57,764-INFO: epoch: 3, iter: 30, lr: 0.000500, 'loss': 6.24672, 'acc': 0.992188, time: 0.982
2020-08-17 11:24:09,334-INFO: Already save model in ./output/rec_CRNN_zhonglian/iter_epoch_3
2020-08-17 11:24:25,794-INFO: epoch: 5, iter: 40, lr: 0.000500, 'loss': 2.217144, 'acc': 0.992188, time: 4.912
2020-08-17 11:24:38,837-INFO: epoch: 6, iter: 50, lr: 0.000500, 'loss': 2.046174, 'acc': 0.992188, time: 0.986
2020-08-17 11:24:49,385-INFO: Already save model in ./output/rec_CRNN_zhonglian/iter_epoch_6
2020-08-17 11:24:57,087-INFO: epoch: 7, iter: 60, lr: 0.000500, 'loss': 1.977265, 'acc': 0.996094, time: 1.010
2020-08-17 11:25:10,721-INFO: epoch: 8, iter: 70, lr: 0.000500, 'loss': 1.150095, 'acc': 1.0, time: 0.981
...
...

zjykzj commented 4 years ago

i found another way to use it: try to specify the model's name

-o Global.checkpoints=output/rec_CRNN_zhonglian/best_accuracy

The complete code is as follows

$ CUDA_VISIBLE_DEVICES=1 python3 tools/eval.py -c configs/rec/rec_chinese_common_train_zhonglian.yml -o Global.checkpoints=output/rec_CRNN_zhonglian/best_accuracy
2020-08-17 13:49:35,256-INFO: {'Global': {'debug': False, 'algorithm': 'CRNN', 'use_gpu': True, 'epoch_num': 1, 'log_smooth_window': 20, 'print_batch_step': 10, 'save_model_dir': './output/rec_CRNN_zhonglian', 'save_epoch_step': 5, 'eval_batch_step': 100, 'train_batch_size_per_card': 128, 'test_batch_size_per_card': 128, 'image_shape': [3, 32, 320], 'max_text_length': 25, 'character_type': 'ch', 'character_dict_path': './ppocr/utils/ppocr_keys_v1.txt', 'loss_type': 'ctc', 'reader_yml': './configs/rec/rec_chinese_reader.yml', 'pretrain_weights': None, 'checkpoints': 'output/rec_CRNN_zhonglian/best_accuracy', 'save_inference_dir': None, 'infer_img': None, 'distort': True}, 'Architecture': {'function': 'ppocr.modeling.architectures.rec_model,RecModel'}, 'Backbone': {'function': 'ppocr.modeling.backbones.rec_resnet_vd,ResNet', 'layers': 34}, 'Head': {'function': 'ppocr.modeling.heads.rec_ctc_head,CTCPredict', 'encoder_type': 'rnn', 'SeqRNN': {'hidden_size': 256}}, 'Loss': {'function': 'ppocr.modeling.losses.rec_ctc_loss,CTCLoss'}, 'Optimizer': {'function': 'ppocr.optimizer,AdamDecay', 'base_lr': 0.0005, 'beta1': 0.9, 'beta2': 0.999}, 'TrainReader': {'reader_function': 'ppocr.data.rec.dataset_traversal,SimpleReader', 'num_workers': 8, 'img_set_dir': './train_data', 'label_file_path': './train_data/rec_gt_train.txt'}, 'EvalReader': {'reader_function': 'ppocr.data.rec.dataset_traversal,SimpleReader', 'img_set_dir': './train_data', 'label_file_path': './train_data/rec_gt_test.txt'}, 'TestReader': {'reader_function': 'ppocr.data.rec.dataset_traversal,SimpleReader'}}
W0817 13:49:36.792258 25075 device_context.cc:237] Please NOTE: device: 0, CUDA Capability: 75, Driver API Version: 10.2, Runtime API Version: 10.0
W0817 13:49:36.796207 25075 device_context.cc:245] device: 0, cuDNN Version: 7.6.
2020-08-17 13:49:40,461-INFO: Finish initing model from output/rec_CRNN_zhonglian/best_accuracy
2020-08-17 13:49:41,414-INFO: eval batch id: 0, acc: 1.0
2020-08-17 13:49:42,281-INFO: eval batch id: 1, acc: 1.0
2020-08-17 13:49:42,843-INFO: eval batch id: 2, acc: 1.0
2020-08-17 13:49:43,435-INFO: eval batch id: 3, acc: 1.0
2020-08-17 13:49:44,064-INFO: eval batch id: 4, acc: 1.0
2020-08-17 13:49:44,807-INFO: eval batch id: 5, acc: 1.0
2020-08-17 13:49:45,470-INFO: eval batch id: 6, acc: 1.0
2020-08-17 13:49:46,074-INFO: eval batch id: 7, acc: 1.0
2020-08-17 13:49:46,075-INFO: Eval result: {'avg_acc': 1.0, 'total_acc_num': 1000, 'total_sample_num': 1000}

PaddlePaddle / PaddleOCR

RuntimeError: Failed to load model file , please make sure model file is saved with the following APIs: save_params, save_persistables, save_vars #543

environment

reproduce