PaddlePaddle / PaddleNLP

👑 Easy-to-use and powerful NLP and LLM library with 🤗 Awesome model zoo, supporting wide-range of NLP tasks from research to industrial applications, including 🗂Text Classification, 🔍 Neural Search, ❓ Question Answering, ℹ️ Information Extraction, 📄 Document Intelligence, 💌 Sentiment Analysis etc.
https://paddlenlp.readthedocs.io
Apache License 2.0
12.11k stars 2.94k forks source link

Batchsize=1显存不足 #2931

Closed zoeChen119 closed 1 year ago

zoeChen119 commented 2 years ago

`# 模型训练: import paddle.nn.functional as F import time

save_dir = "checkpoint/bert-wwm" if not os.path.exists(save_dir): os.makedirs(save_dir)

save_train_result = "./results/bert-wwm.tsv" train_r_df = pd.DataFrame(data=None, columns=["global_step","epoch","step","loss","acc","time"])

pre_accu=0 accu=0 global_step = 0 epochs = 10 for epoch in range(1, epochs + 1): for step, batch in enumerate(train_data_loader, start=1): start = time.time() input_ids, segment_ids, labels = batch logits = model(input_ids, segment_ids) loss = criterion(logits, labels) probs = F.softmax(logits, axis=1) correct = metric.compute(probs, labels) metric.update(correct) acc = metric.accumulate() global_step += 1 if global_step % 2 == 0 : print("global step %d, epoch: %d, batch: %d, loss: %.5f, acc: %.5f" % (global_step, epoch, step, loss, acc)) loss.backward() optimizer.step() lr_scheduler.step() optimizer.clear_grad()

统计运行时间

    end = time.time()
    train_r_df = train_r_df.append({"global_step":global_step, "epoch":epoch,"step":step,"loss":loss,"acc":acc,"time":end-start},ignore_index=True)
# 每轮结束对验证集进行评估
accu = evaluate(model, criterion, metric, dev_data_loader)
print(accu)  
if accu > pre_accu:
    # 保存较上一轮效果更优的模型参数
    save_param_path = os.path.join(save_dir, 'model_state.pdparams')  # 保存模型参数
    paddle.save(model.state_dict(), save_param_path)
    pre_accu=accu

tokenizer.save_pretrained(save_dir) train_r_df.to_csv(save_train_result, sep="\t", index=False, header=True)`

SystemError: (Fatal) Operator dropout raises an paddle::memory::allocation::BadAlloc exception. The exception content is :ResourceExhaustedError:

Out of memory error on GPU 0. Cannot allocate 12.000000MB memory on GPU 0, 39.397339GB memory has been allocated and available memory is only 11.562500MB.

Please check whether there is any other process using GPU 0.

  1. If yes, please stop them, or start PaddlePaddle on another GPU.
  2. If no, please decrease the batch size of your model. If the above ways do not solve the out of memory problem, you can try to use CUDA managed memory. The command is export FLAGS_use_cuda_managed_memory=false. (at /paddle/paddle/fluid/memory/allocation/cuda_allocator.cc:87) . (at /paddle/paddle/fluid/imperative/tracer.cc:307)
zoeChen119 commented 2 years ago

使用的模型是bert-wwm-chinese,albert-chinese-tiny,skep_ernie_1.0_large_ch

LiuChiachi commented 2 years ago

您的显存有多大呢,换一张卡试试?

zoeChen119 commented 2 years ago

您的显存有多大呢,换一张卡试试? image

github-actions[bot] commented 1 year ago

This issue is stale because it has been open for 60 days with no activity. 当前issue 60天内无活动,被标记为stale。

github-actions[bot] commented 1 year ago

This issue was closed because it has been inactive for 14 days since being marked as stale. 当前issue 被标记为stale已有14天,即将关闭。