zoeChen119 commented 2 years ago

版本、环境信息 1）PaddleNLP 2.3，PaddlePaddle2.3 2）系统环境：Linux，python3.7 3）batch_size=1,max_seq_lenth=512,train600条，test200条，dev200条

`# 模型训练： import paddle.nn.functional as F import time

save_dir = "checkpoint/bert-wwm" if not os.path.exists(save_dir): os.makedirs(save_dir)

save_train_result = "./results/bert-wwm.tsv" train_r_df = pd.DataFrame(data=None, columns=["global_step","epoch","step","loss","acc","time"])

pre_accu=0 accu=0 global_step = 0 epochs = 10 for epoch in range(1, epochs + 1): for step, batch in enumerate(train_data_loader, start=1): start = time.time() input_ids, segment_ids, labels = batch logits = model(input_ids, segment_ids) loss = criterion(logits, labels) probs = F.softmax(logits, axis=1) correct = metric.compute(probs, labels) metric.update(correct) acc = metric.accumulate() global_step += 1 if global_step % 2 == 0 : print("global step %d, epoch: %d, batch: %d, loss: %.5f, acc: %.5f" % (global_step, epoch, step, loss, acc)) loss.backward() optimizer.step() lr_scheduler.step() optimizer.clear_grad()

统计运行时间

    end = time.time()
    train_r_df = train_r_df.append({"global_step":global_step, "epoch":epoch,"step":step,"loss":loss,"acc":acc,"time":end-start},ignore_index=True)
# 每轮结束对验证集进行评估
accu = evaluate(model, criterion, metric, dev_data_loader)
print(accu)  
if accu > pre_accu:
    # 保存较上一轮效果更优的模型参数
    save_param_path = os.path.join(save_dir, 'model_state.pdparams')  # 保存模型参数
    paddle.save(model.state_dict(), save_param_path)
    pre_accu=accu

tokenizer.save_pretrained(save_dir) train_r_df.to_csv(save_train_result, sep="\t", index=False, header=True)`

报错信息：

SystemError: (Fatal) Operator dropout raises an paddle::memory::allocation::BadAlloc exception. The exception content is :ResourceExhaustedError:

Out of memory error on GPU 0. Cannot allocate 12.000000MB memory on GPU 0, 39.397339GB memory has been allocated and available memory is only 11.562500MB.

Please check whether there is any other process using GPU 0.

If yes, please stop them, or start PaddlePaddle on another GPU.

If no, please decrease the batch size of your model. If the above ways do not solve the out of memory problem, you can try to use CUDA managed memory. The command is export FLAGS_use_cuda_managed_memory=false. (at /paddle/paddle/fluid/memory/allocation/cuda_allocator.cc:87) . (at /paddle/paddle/fluid/imperative/tracer.cc:307)

zoeChen119 commented 2 years ago

使用的模型是bert-wwm-chinese,albert-chinese-tiny,skep_ernie_1.0_large_ch

LiuChiachi commented 2 years ago

您的显存有多大呢，换一张卡试试？

zoeChen119 commented 2 years ago

您的显存有多大呢，换一张卡试试？

github-actions[bot] commented 1 year ago

This issue is stale because it has been open for 60 days with no activity. 当前issue 60天内无活动，被标记为stale。

github-actions[bot] commented 1 year ago

This issue was closed because it has been inactive for 14 days since being marked as stale. 当前issue 被标记为stale已有14天，即将关闭。

PaddlePaddle / PaddleNLP

Batchsize=1显存不足 #2931

统计运行时间