PaddlePaddle / Paddle

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)
http://www.paddlepaddle.org/
Apache License 2.0
22.24k stars 5.58k forks source link

【多线程】RuntimeError: (AlreadyExists) Accumulators are not empty before preparing it for backward network execution. #46056

Closed kk19990709 closed 2 years ago

kk19990709 commented 2 years ago

bug描述 Describe the Bug

以上是复现PaddleNLP/examples/text_classfication/rnn的代码,一键即可运行。设置线程数为1时可以很好地运行,顺序执行多次也可以运行。

代码

import threading
from threading import Lock, Thread
from functools import partial
import argparse
import os
import random

import numpy as np
import paddle
from paddlenlp.data import JiebaTokenizer, Pad, Stack, Tuple, Vocab
from paddlenlp.datasets import load_dataset

from model import BoWModel, BiLSTMAttentionModel, CNNModel, LSTMModel, GRUModel, RNNModel, SelfInteractiveAttention
from utils import convert_example, build_vocab

# yapf: disable
parser = argparse.ArgumentParser(__doc__)
parser.add_argument("--epochs", type=int, default=15, help="Number of epoches for training.")
parser.add_argument('--device', choices=['cpu', 'gpu', 'xpu', 'npu'], default="gpu", help="Select which device to train model, defaults to gpu.")
parser.add_argument("--lr", type=float, default=5e-5, help="Learning rate used to train.")
parser.add_argument("--save_dir", type=str, default='checkpoints/', help="Directory to save model checkpoint")
parser.add_argument("--batch_size", type=int, default=64, help="Total examples' number of a batch for training.")
parser.add_argument("--vocab_path", type=str, default="./vocab.json", help="The file path to save vocabulary.")
parser.add_argument('--network', choices=['bow', 'lstm', 'bilstm', 'gru', 'bigru', 'rnn', 'birnn', 'bilstm_attn', 'cnn'],
    default="bilstm", help="Select which network to train, defaults to bilstm.")
parser.add_argument("--init_from_ckpt", type=str, default=None, help="The path of checkpoint to be loaded.")
args = parser.parse_args()
# yapf: enable

def set_seed(seed=1000):
    """sets random seed"""
    random.seed(seed)
    np.random.seed(seed)
    paddle.seed(seed)

def create_dataloader(dataset,
                      trans_fn=None,
                      mode='train',
                      batch_size=1,
                      batchify_fn=None):
    """
    Creats dataloader.

    Args:
        dataset(obj:`paddle.io.Dataset`): Dataset instance.
        trans_fn(obj:`callable`, optional, defaults to `None`): function to convert a data sample to input ids, etc.
        mode(obj:`str`, optional, defaults to obj:`train`): If mode is 'train', it will shuffle the dataset randomly.
        batch_size(obj:`int`, optional, defaults to 1): The sample number of a mini-batch.
        batchify_fn(obj:`callable`, optional, defaults to `None`): function to generate mini-batch data by merging
            the sample list, None for only stack each fields of sample in axis
            0(same as :attr::`np.stack(..., axis=0)`).

    Returns:
        dataloader(obj:`paddle.io.DataLoader`): The dataloader which generates batches.
    """
    if trans_fn:
        dataset = dataset.map(trans_fn)

    shuffle = True if mode == 'train' else False
    if mode == "train":
        sampler = paddle.io.DistributedBatchSampler(dataset=dataset,
                                                    batch_size=batch_size,
                                                    shuffle=shuffle)
    else:
        sampler = paddle.io.BatchSampler(dataset=dataset,
                                         batch_size=batch_size,
                                         shuffle=shuffle)
    dataloader = paddle.io.DataLoader(dataset,
                                      batch_sampler=sampler,
                                      collate_fn=batchify_fn)
    return dataloader

def main():
    paddle.set_device(args.device)
    set_seed(1000)

    # Loads dataset.
    train_ds, dev_ds = load_dataset("chnsenticorp", splits=["train", "dev"])
    texts = []
    for data in train_ds:
        texts.append(data["text"])
    for data in dev_ds:
        texts.append(data["text"])

    # Reads stop words.
    # Stopwords are just for example.
    # It should be updated according to the corpus.
    stopwords = set(["的", "吗", "吧", "呀", "呜", "呢", "呗"])
    # Builds vocab.
    word2idx = build_vocab(texts,
                           stopwords,
                           min_freq=5,
                           unk_token="[UNK]",
                           pad_token="[PAD]")
    vocab = Vocab.from_dict(word2idx, unk_token="[UNK]", pad_token="[PAD]")
    # Saves vocab.
    vocab.to_json(args.vocab_path)

    # Constructs the network.
    network = args.network.lower()
    vocab_size = len(vocab)
    num_classes = len(train_ds.label_list)
    pad_token_id = vocab.to_indices('[PAD]')
    if network == 'bow':
        model = BoWModel(vocab_size, num_classes, padding_idx=pad_token_id)
    elif network == 'bigru':
        model = GRUModel(vocab_size,
                         num_classes,
                         direction='bidirect',
                         padding_idx=pad_token_id)
    elif network == 'bilstm':
        model = LSTMModel(vocab_size,
                          num_classes,
                          direction='bidirect',
                          padding_idx=pad_token_id)
    elif network == 'bilstm_attn':
        lstm_hidden_size = 196
        attention = SelfInteractiveAttention(hidden_size=2 * lstm_hidden_size)
        model = BiLSTMAttentionModel(attention_layer=attention,
                                     vocab_size=vocab_size,
                                     lstm_hidden_size=lstm_hidden_size,
                                     num_classes=num_classes,
                                     padding_idx=pad_token_id)
    elif network == 'birnn':
        model = RNNModel(vocab_size,
                         num_classes,
                         direction='bidirect',
                         padding_idx=pad_token_id)
    elif network == 'cnn':
        model = CNNModel(vocab_size, num_classes, padding_idx=pad_token_id)
    elif network == 'gru':
        model = GRUModel(vocab_size,
                         num_classes,
                         direction='forward',
                         padding_idx=pad_token_id,
                         pooling_type='max')
    elif network == 'lstm':
        model = LSTMModel(vocab_size,
                          num_classes,
                          direction='forward',
                          padding_idx=pad_token_id,
                          pooling_type='max')
    elif network == 'rnn':
        model = RNNModel(vocab_size,
                         num_classes,
                         direction='forward',
                         padding_idx=pad_token_id,
                         pooling_type='max')
    else:
        raise ValueError(
            "Unknown network: %s, it must be one of bow, lstm, bilstm, cnn, gru, bigru, rnn, birnn and bilstm_attn."
            % network)
    model = paddle.Model(model)

    # Reads data and generates mini-batches.
    tokenizer = JiebaTokenizer(vocab)
    trans_fn = partial(convert_example, tokenizer=tokenizer, is_test=False)
    batchify_fn = lambda samples, fn=Tuple(
        Pad(axis=0, pad_val=vocab.token_to_idx.get('[PAD]', 0)),  # input_ids
        Stack(dtype="int64"),  # seq len
        Stack(dtype="int64")  # label
    ): [data for data in fn(samples)]
    train_loader = create_dataloader(train_ds,
                                     trans_fn=trans_fn,
                                     batch_size=args.batch_size,
                                     mode='train',
                                     batchify_fn=batchify_fn)
    dev_loader = create_dataloader(dev_ds,
                                   trans_fn=trans_fn,
                                   batch_size=args.batch_size,
                                   mode='validation',
                                   batchify_fn=batchify_fn)

    optimizer = paddle.optimizer.Adam(parameters=model.parameters(),
                                      learning_rate=args.lr)

    # Defines loss and metric.
    criterion = paddle.nn.CrossEntropyLoss()
    metric = paddle.metric.Accuracy()

    model.prepare(optimizer, criterion, metric)

    # Loads pre-trained parameters.
    if args.init_from_ckpt:
        model.load(args.init_from_ckpt)
        print("Loaded checkpoint from %s" % args.init_from_ckpt)

    # Starts training and evaluating.
    callback = paddle.callbacks.ProgBarLogger(log_freq=10, verbose=3)
    model.fit(train_loader,
              dev_loader,
              epochs=args.epochs,
              save_dir=args.save_dir,
              callbacks=callback)

if __name__ == "__main__":
    l = []
    for i in range(4):
        p = Thread(target=main)
        l.append(p)
        p.start()
    for p in l:
        p.join()

运行方法

./python3 train.py \
    --vocab_path='./vocab.json' \
    --device=cpu \
    --network=bow \
    --lr=0.004 \
    --batch_size=6 \
    --epochs=42 \
    --save_dir='./checkpoints'

错误信息

Exception in thread Thread-5:
Traceback (most recent call last):
  File "/home/output/python37/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "/home/output/python37/lib/python3.7/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "train2.py", line 210, in main
    callbacks=callback)
  File "/home/output/python37/lib/python3.7/site-packages/paddle/hapi/model.py", line 1732, in fit
    logs = self._run_one_epoch(train_loader, cbks, 'train')
  File "/home/output/python37/lib/python3.7/site-packages/paddle/hapi/model.py", line 2062, in _run_one_epoch
    outs = getattr(self, mode + '_batch')(*_inputs)
  File "/home/output/python37/lib/python3.7/site-packages/paddle/hapi/model.py", line 1061, in train_batch
    loss = self._adapter.train_batch(inputs, labels, update)
  File "/home/output/python37/lib/python3.7/site-packages/paddle/hapi/model.py", line 727, in train_batch
    final_loss.backward()
  File "</home/output/python37/lib/python3.7/site-packages/decorator.py:decorator-gen-128>", line 2, in backward
  File "/home/output/python37/lib/python3.7/site-packages/paddle/fluid/wrapped_decorator.py", line 25, in __impl__
    return wrapped_func(*args, **kwargs)
  File "/home/output/python37/lib/python3.7/site-packages/paddle/fluid/framework.py", line 229, in __impl__
    return func(*args, **kwargs)
  File "/home/output/python37/lib/python3.7/site-packages/paddle/fluid/dygraph/varbase_patch_methods.py", line 249, in backward
    framework._dygraph_tracer())
RuntimeError: (AlreadyExists) Accumulators are not empty before preparing it for backward network execution.
  [Hint: Expected accumulators_.empty() == true, but received accumulators_.empty():0 != true:1.] (at /paddle/paddle/fluid/imperative/basic_engine.cc:55)

其他补充信息 Additional Supplementary Information

paddle-bot[bot] commented 2 years ago

您好,我们已经收到了您的问题,会安排技术人员尽快解答您的问题,请耐心等待。请您再次检查是否提供了清晰的问题描述、复现代码、环境&版本、报错信息等。同时,您也可以通过查看官网API文档常见问题历史IssueAI社区来寻求解答。祝您生活愉快~

Hi! We've received your issue and please be patient to get responded. We will arrange technicians to answer your questions as soon as possible. Please make sure that you have posted enough message to demo your request. You may also check out the APIFAQGithub Issue and AI community to get the answer.Have a nice day!

limin2021 commented 2 years ago

你好,Paddle动态图不支持多线程的方式,支持多进程。你可以使用多进程来试试。

kk19990709 commented 2 years ago

已修改,感谢!