训练过程报错 - Githubissues

Hikaylee commented 1 year ago

您好，请问一下为什么每个epoch训练都会报以下这个错误然后训练就被中断，而且每次都是训练到7020iter的时候。

2023-03-04 06:39:27.342 | INFO | yolox.core.vid_trainer:after_iter:279 - epoch: 1/7, iter: 7020/9366, mem: 8055Mb,`iter_time: 4.363s, data_time: 3.605s, total_loss: 1.1, iou_loss: 0.7, l1_loss: 0.0, conf_loss: 0.2, cls_loss: 0.1, lr: 2.247e-03, size: 480, ETA: 2 days, 21:48:48 2023-03-04 06:39:38.261 | INFO | yolox.core.vid_trainer:after_train:198 - Training of experiment is done and the best AP is 0.00 2023-03-04 06:39:38.262 | ERROR | yolox.core.launch:launch:98 - An error has been caught in function 'launch', process 'MainProcess' (267170), thread 'MainThread' (140410779206464): Traceback (most recent call last):

File "tools/vid_train.py", line 151, in args=(exp, args), │ └ Namespace(batch_size=128, cache=False, ckpt='/media/user/A0F260D9F260B566/qsy/YOLOV/weights/yoloxsvid.pth', devices=1, dist... └ ╒═══════════════════╤════════════════════════════════════════════════════════════════════════════════════════════════════════...

File "./yolox/core/launch.py", line 98, in launch main_func(*args) │ └ (╒═══════════════════╤═══════════════════════════════════════════════════════════════════════════════════════════════════════... └ <function main at 0x7fb305add4d0>

File "tools/vid_train.py", line 128, in main trainer.train() │ └ <function Trainer.train at 0x7fb305ae15f0> └ <yolox.core.vid_trainer.Trainer object at 0x7fb3ed55a990>

File "./yolox/core/vid_trainer.py", line 85, in train self.train_in_epoch() │ └ <function Trainer.train_in_epoch at 0x7fb305ae1b90> └ <yolox.core.vid_trainer.Trainer object at 0x7fb3ed55a990>

File "./yolox/core/vid_trainer.py", line 94, in train_in_epoch self.train_in_iter() │ └ <function Trainer.train_in_iter at 0x7fb305ae1dd0> └ <yolox.core.vid_trainer.Trainer object at 0x7fb3ed55a990>

File "./yolox/core/vid_trainer.py", line 100, in train_in_iter self.train_one_iter() │ └ <function Trainer.train_one_iter at 0x7fb305ae4d40> └ <yolox.core.vid_trainer.Trainer object at 0x7fb3ed55a990>

File "./yolox/core/vid_trainer.py", line 107, in train_one_iter inps = inps.to(self.data_type) │ │ └ torch.float16 │ └ <yolox.core.vid_trainer.Trainer object at 0x7fb3ed55a990> └ None

AttributeError: 'NoneType' object has no attribute 'to'

Hikaylee commented 1 year ago

@YuHengsss 希望您能帮忙看一下谢谢

YuHengsss commented 1 year ago

hello, 这个 iteration 的某个 img 应该没读到，可以查看一下它的输入情况。我怀疑是部分图像缺失的问题

Hikaylee commented 1 year ago

那请问一下运行python tools/val_to_imdb.py 来验证的时候，为什么运行过程中电脑内存占用会一直在增长，每次我都还没运行完所有（555个）循环就因为内存满了导致中断

YuHengsss commented 1 year ago

因为predictions 的总数太多啦，32G 的内存在我们的测试环境中有时也会炸内存

YuHengsss commented 1 year ago

因为predictions 的总数太多啦，32G 的内存在我们的测试环境中有时也会炸内存

可以通过修改confidence threshold 或者加大内存缓解，不过前者有会有略微掉点

Hikaylee commented 1 year ago

因为predictions 的总数太多啦，32G 的内存在我们的测试环境中有时也会炸内存

可以通过修改confidence threshold 或者加大内存缓解，不过前者有会有略微掉点

我本来尝试了每次循环都把结果用pickle.dump((res[0], res[1]), file_writter)写入，而不是最后append一次性写入，但发现这样最后得到的pickle文件大小比原来将近大了10倍，但逻辑上我没有找到问题，是因为pickle.dump会压缩数据吗？

YuHengsss commented 1 year ago

是不是忘记转格式了？https://github.com/YuHengsss/YOLOV/blob/0331141fbe1572d22f9f0a22f5b88d500e2cbe76/tools/val_to_imdb.py#L330

Hikaylee commented 1 year ago

是不是忘记转格式了？

https://github.com/YuHengsss/YOLOV/blob/0331141fbe1572d22f9f0a22f5b88d500e2cbe76/tools/val_to_imdb.py#L330

别的地方都没变，就只在最后改了一下写入的顺序： for ele in res: cur_iter += 1 if cur_iter % 10 == 0: print(str(cur_iter) + '/' + str(len(res))) first_frame = ele[0][0] video_name = first_frame[first_frame.find('val'):first_frame.rfind('/')]

    preds_video = {}
    **repp_res = []**
    for frames in ele:
        # frames
        if frames == []: continue
        tmp_imgs = []
        for img in frames:
            img = cv2.imread(os.path.join(exp.data_dir, img))
            height, width = img.shape[:2]
            ratio = min(predictor.test_size[0] / img.shape[0], predictor.test_size[1] / img.shape[1])
            img, _ = predictor.preproc(img, None, predictor.test_size)
            img = torch.from_numpy(img)
            tmp_imgs.append(img)
        imgs = torch.stack(tmp_imgs)    

        pred_res = predictor.inference(imgs)
        del imgs
        for pred, img_name in zip(pred_res, frames):
            point_idx = img_name.rfind('.')
            image_id = img_name[img_name.find('val'):point_idx]
            img_idx = img_name[img_name.rfind('/') + 1:point_idx]
            det_repp = predictor.to_repp_heavy(pred, ratio, [height, width], image_id)
            preds_video[img_idx] = det_repp
    **repp_res = [video_name, preds_video]
    pickle.dump((repp_res [0], repp_res [1]), file_writter)
file_writter.close()**

YuHengsss / YOLOV

训练过程报错 #47