PaddlePaddle / Paddle

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)
http://www.paddlepaddle.org/
Apache License 2.0
22.24k stars 5.58k forks source link

paddle无法在一个程序中迭代训练模型,无法重复加载同一个模型 #52990

Open MozerWang opened 1 year ago

MozerWang commented 1 year ago

bug描述 Describe the Bug

问题:我利用rocketqa做self_training,因此在迭代过程中,需要多次load同一模型。先去做inference获得伪标签,再去利用伪标签做finetuning,根据我写的封装逻辑,这个过程要对同一模型加载两次,但paddle框架应该不支持这样操作,因此会报错 代码如下:

import rocketqa
ce_model = “zh_dureader_ce_v2”
ce_conf = {
        "model": ce_model,
        "use_cuda": True,
        "device_id": 0,
        "batch_size": 32
    }
 cross_encoder = rocketqa.load_model(**ce_conf)
 cross_encoder = rocketqa.load_model(**ce_conf)

报错如下

Traceback (most recent call last):
  File "/u01/bankQA/self_training/test_rkqa.py", line 297, in <module>
    cross_encoder = rocketqa.load_model(**ce_conf)
  File "/u01/miniconda3/envs/bankqa/lib/python3.8/site-packages/rocketqa/rocketqa.py", line 122, in load_model
    encoder = CrossEncoder(**encoder_conf)
  File "/u01/miniconda3/envs/bankqa/lib/python3.8/site-packages/rocketqa/encoder/cross_encoder.py", line 90, in __init__
    self.test_pyreader, self.graph_vars = create_predict_model(
  File "/u01/miniconda3/envs/bankqa/lib/python3.8/site-packages/rocketqa/model/cross_encoder_predict.py", line 39, in create_predict_model
    pyreader = fluid.layers.py_reader(
  File "/u01/miniconda3/envs/bankqa/lib/python3.8/site-packages/paddle/fluid/layers/io.py", line 723, in py_reader
    return _py_reader(
  File "/u01/miniconda3/envs/bankqa/lib/python3.8/site-packages/paddle/fluid/layers/io.py", line 440, in _py_reader
    feed_queue = core.init_lod_tensor_blocking_queue(var, capacity, False)
RuntimeError: (AlreadyExists) LoDTensorBlockingQueueHolder::InitOnce() can only be called once
  [Hint: Expected queue_ == nullptr, but received queue_ != nullptr.] (at /paddle/paddle/fluid/operators/reader/lod_tensor_blocking_queue.h:207)

其他补充信息 Additional Supplementary Information

No response

FlyingQianMM commented 1 year ago

不需要同时new两个model,可以new一个model,然后再加载两次不同的参数。

MozerWang commented 1 year ago

不需要同时new两个model,可以new一个model,然后再加载两次不同的参数。

ok,以上只是我举了一个简单例子,我实际遇到的问题是:在循环中,迭代训练模型时就会报错!以下是一个伪代码,以说明模型运行情况

a = rocketqa.load(teacher model)
for iter_number in range(0,3):
      newdataset = inference(a, unlabelset)
      model.config = model.train(a, newdataset)
      a = rocketqa.load(model.config)
      evaluate(a)

在循环外(迭代外)加载teacher模型,然后进迭代(获取伪标签->训练新模型->加载新模型->评估-->获取伪标签),在第一次循环时(进行第一次训练),不报错。但是到第二次循环,就会报错:

File "/u01/miniconda3/envs/bankqa/lib/python3.8/site-packages/paddle/fluid/layers/io.py", line 440, in _py_reader
    feed_queue = core.init_lod_tensor_blocking_queue(var, capacity, False)
RuntimeError: (AlreadyExists) LoDTensorBlockingQueueHolder::InitOnce() can only be called once
  [Hint: Expected queue_ == nullptr, but received queue_ != nullptr.] (at /paddle/paddle/fluid/operators/reader/lod_tensor_blocking_queue.h:207)

下面是我的真实代码,,思路跟上面伪代码是一样的:

#加载模型
    dual_encoder = load_retriever_model(de_model,device_id,batch_size)
    cross_encoder = load_retriever_model(ce_model,device_id,batch_size)

    #评估teacher模型
    logging.info(f"Evaluating base zero-shot model:{de_model} performance on test set")
    prediction = get_zero_shot_predictions(dual_encoder, cross_encoder, data_file=data_file, 
                                           index_file='testindex', topk=20, input_data=test_data)
    evaluation = evaluate_retriever_performance(prediction)

    for iter_number in range(1, num_iterations+1):
        #在无标签数据上进行推理
        logging.info(f"Inferring with {de_model} on unlabeled elements:{unlabel_data_file})")
        prediction = get_zero_shot_predictions(dual_encoder, cross_encoder, data_file=unlabel_data_file,
                                                index_file=f'{iter_number}_unlabelindex', topk=100, input_data=train_data)
        logging.info("Done inferring zero-shot model on unlabeled elements")
        #获得伪标签
        self_training_set = get_selftraining_dataset(prediction, unlabel_data_file, data_path, iter_number)
        logging.info(f"Done collecting pseudo-labeled elements for self-training iteration {iter_number}"
                     f"The pseudo-labeled texts are saving in {self_training_set}")
        #基于伪标签进行训练
        # We use the updated pseudo-labeled set from this iteration to fine-tune the *base* entailment model
        logging.info(f"Fine-tuning model:{de_model} on pseudo-labeled texts")
        finetuned_model_path = finetune_entailment_model(dual_encoder, self_training_set, iter_number,
                                                         learning_rate=1e-5, save_steps=5000, num_epochs=20)
        logging.info(f"Done fine-tuning. Model for self-training iteration {iter_number} "
                     f"saved to {finetuned_model_path}.")
        #加载并评估训练好的模型
        de_model = os.path.join(finetuned_model_path,"config.json")
        dual_encoder = load_retriever_model(de_model,device_id,batch_size)
        logging.info(f'iteration {iter_number}: evaluating model {de_model} performance on test set')
        test_preds = get_zero_shot_predictions(dual_encoder, cross_encoder, data_file=data_file, 
                                           index_file=f'{iter_number}_testindex', topk=20, input_data=test_data)
        evaluation = evaluate_retriever_performance(test_preds)