训练rec模型时，总是遇到`ZeroDivisionError: float division by zero`导致训练中断

lona-cn commented 1 month ago

🔎 Search before asking

[x] I have searched the PaddleOCR Docs and found no similar bug report.
[X] I have searched the PaddleOCR Issues and found no similar bug report.
[X] I have searched the PaddleOCR Discussions and found no similar bug report.

🐛 Bug (问题描述)

描述

在对从models_list中en_PP-OCRv4_rec.yml对应的en_PP-OCRv4_rec模型进行训练时，总是在300-1000 epoch左右出现ZeroDivisionError: float division by zero 我对en_PP-OCRv4_rec.yml只修改了Global.pretrained_model、Global.epoch_num、Global.save_model_dir、Train.dataset.data_dir、Train.dataset.label_file_list、Eval.dataset.data_dir、Eval.dataset.label_file_list这几个必要的参数，其余均保持从models_list下载时的默认参数。

控制台关键信息

训练命令:python .\tools\train.py -c G:\work\ai\ocr\paddleocr\models\train\en_PP-OCRv4_rec.yml

[2024/10/12 14:08:23] ppocr INFO: save model in F:/train_output/rec_ppocr_v4\latest
[2024/10/12 14:08:27] ppocr INFO: epoch: [850/20000], global_step: 2000, lr: 0.000498, acc: 1.000000, norm_edit_dis: 1.000000, CTCLoss: 0.003308, NRTRLoss: 0.788387, loss: 0.790897, avg_reader_cost: 0.10495 s, avg_batch_cost: 0.83570 s, avg_samples: 69.76, ips: 83.47495 samples/s, eta: 1 day, 3:10:05, max_mem_reserved: 12103 MB, max_mem_allocated: 11955 MB
eval model::   0%|                                                                               | 0/1 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "G:\work\pythonProject\PaddleOCR\tools\train.py", line 264, in <module>
    main(config, device, logger, vdl_writer, seed)
  File "G:\work\pythonProject\PaddleOCR\tools\train.py", line 217, in main
    program.train(
  File "G:\work\pythonProject\PaddleOCR\tools\program.py", line 464, in train
    cur_metric = eval(
  File "G:\work\pythonProject\PaddleOCR\tools\program.py", line 712, in eval
    metric["fps"] = total_frame / total_time
ZeroDivisionError: float division by zero

简单分析，total_time在代码的for循环中可能会出现从未被设置的情况，导致ZeroDivisionError。

🏃‍♂️ Environment (运行环境)

OS: WIN11 23H2
paddleocr: 0e3cfc0792b795e980beb03b196a91e3d0e286ca（当前main分支最新）
paddlepaddle-gpu: 2.4.2.post117
cudatoolkit: 11.3.1
cudnn: 8.2.1
python: 3.9.19

🌰 Minimal Reproducible Example (最小可复现问题的Demo)

因为没有对代码有改动，这里摘抄一下出现这个问题相关区域的代码。

def eval(
    model,
    valid_dataloader,
    post_process_class,
    eval_class,
    model_type=None,
    extra_input=False,
    scaler=None,
    amp_level="O2",
    amp_custom_black_list=[],
    amp_custom_white_list=[],
    amp_dtype="float16",
):
    model.eval()
    with paddle.no_grad():
        total_frame = 0.0
        total_time = 0.0
        pbar = tqdm(
            total=len(valid_dataloader), desc="eval model:", position=0, leave=True
        )
        max_iter = (
            len(valid_dataloader) - 1
            if platform.system() == "Windows"
            else len(valid_dataloader)
        )
        sum_images = 0
        for idx, batch in enumerate(valid_dataloader):
            if idx >= max_iter:
                break
            images = batch[0]
            start = time.time()

            # use amp
            if scaler:
                with paddle.amp.auto_cast(
                    level=amp_level,
                    custom_black_list=amp_custom_black_list,
                    dtype=amp_dtype,
                ):
                    if model_type == "table" or extra_input:
                        preds = model(images, data=batch[1:])
                    elif model_type in ["kie"]:
                        preds = model(batch)
                    elif model_type in ["can"]:
                        preds = model(batch[:3])
                    elif model_type in ["latexocr"]:
                        preds = model(batch)
                    elif model_type in ["sr"]:
                        preds = model(batch)
                        sr_img = preds["sr_img"]
                        lr_img = preds["lr_img"]
                    else:
                        preds = model(images)
                preds = to_float32(preds)
            else:
                if model_type == "table" or extra_input:
                    preds = model(images, data=batch[1:])
                elif model_type in ["kie"]:
                    preds = model(batch)
                elif model_type in ["can"]:
                    preds = model(batch[:3])
                elif model_type in ["latexocr"]:
                    preds = model(batch)
                elif model_type in ["sr"]:
                    preds = model(batch)
                    sr_img = preds["sr_img"]
                    lr_img = preds["lr_img"]
                else:
                    preds = model(images)

            batch_numpy = []
            for item in batch:
                if isinstance(item, paddle.Tensor):
                    batch_numpy.append(item.numpy())
                else:
                    batch_numpy.append(item)
            # Obtain usable results from post-processing methods
            total_time += time.time() - start
            # Evaluate the results of the current batch
            if model_type in ["table", "kie"]:
                if post_process_class is None:
                    eval_class(preds, batch_numpy)
                else:
                    post_result = post_process_class(preds, batch_numpy)
                    eval_class(post_result, batch_numpy)
            elif model_type in ["sr"]:
                eval_class(preds, batch_numpy)
            elif model_type in ["can"]:
                eval_class(preds[0], batch_numpy[2:], epoch_reset=(idx == 0))
            elif model_type in ["latexocr"]:
                post_result = post_process_class(preds, batch[1], "eval")
                eval_class(post_result[0], post_result[1], epoch_reset=(idx == 0))
            else:
                post_result = post_process_class(preds, batch_numpy[1])
                eval_class(post_result, batch_numpy)

            pbar.update(1)
            total_frame += len(images)
            sum_images += 1
        # Get final metric，eg. acc or hmean
        metric = eval_class.get_metric()

    pbar.close()
    model.train()
    # total_time 可能为0，导致ZeroDivisionError
    metric["fps"] = total_frame / total_time
    return metric

lona-cn commented 1 month ago

我自己一把梭直接改成了

pbar.close()
    model.train()
    if total_time == 0:
        total_time = 1
    metric["fps"] = total_frame / total_time

至少能继续炼丹了

Liyulingyue commented 1 month ago

感谢反馈，如有时间，欢迎提一个PR将这个问题修复了~（在除式之前加个防护即可）

PaddlePaddle / PaddleOCR