JDAI-CV / fast-reid

SOTA Re-identification Methods and Toolbox
Apache License 2.0
3.42k stars 837 forks source link

训练过程中,内存一直增长,到后期会把整个服务器的内存都占完 #673

Closed rrjia closed 2 years ago

rrjia commented 2 years ago
ys.platform            linux
Python                  3.9.7 (default, Sep 16 2021, 13:09:58) [GCC 7.5.0]
numpy                   1.22.4
fastreid                1.3 @/ssd8/exec/jiaruoran/python/fast-reid-master/./fastreid
PyTorch                 1.7.1+cu101 @/ssd7/exec/jiaruoran/anaconda3/lib/python3.9/site-packages/torch
PyTorch debug build     False
GPU available           True
GPU 0,1,2,3             Tesla K80
CUDA_HOME               /ssd1/shared/local/cuda-10.1
Pillow                  8.4.0
torchvision             0.8.2+cu101 @/ssd7/exec/jiaruoran/anaconda3/lib/python3.9/site-packages/torchvision
torchvision arch flags  sm_35, sm_50, sm_60, sm_70, sm_75
cv2                     4.5.5
----------------------  -----------------------------------------------------------------------------------
PyTorch built with:
  - GCC 7.3
  - C++ Version: 201402
  - Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v1.6.0 (Git Hash 5ef631a030a6f73131c77892041042805a06064f)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 10.1
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75
  - CuDNN 7.6.3
  - Magma 2.5.2
  - Build settings: BLAS=MKL, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DUSE_VULKAN_WRAPPER -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, 

data_loader num_worker设为多个时,内存增长的尤其快,num_worker=0时也会持续增长,排除pytorch dataloader问题

rrjia commented 2 years ago


rrjia commented 2 years ago

初步分析,是数据增强中的 AugMix 和 RandomPatch 这两个方法的原因,我把这两个方法去掉后,就正常了,话说京东负责这个项目的大佬都被优化了么?没人回复了

L1aoXingyu commented 2 years ago

感谢反馈问题,负责这个项目的人(就是我)已经去干别的工作了,这个项目目前就是为爱发电的状态,欢迎你提 PR 来修复这个问题,如果我有时间会复现一下这个问题(最近比较忙)。。

rrjia commented 2 years ago


yougerli commented 2 years ago

我也遇到了内存持续增涨的问题,2000w数据,worker设为4, 单卡batchsize 512,基本上训练一个epoch就是内存爆掉。但是我没有用到 AugMix 和 RandomPatch这两个增强方式,甚至去掉了所有的数据增强还是会爆,目前还不清楚原因

rrjia commented 2 years ago

我也遇到了内存持续增涨的问题,2000w数据,worker设为4, 单卡batchsize 512,基本上训练一个epoch就是内存爆掉。但是我没有用到 AugMix 和 RandomPatch这两个增强方式,甚至去掉了所有的数据增强还是会爆,目前还不清楚原因


  # if do_augmix:
  #     res.append(AugMix(prob=augmix_prob))
  ### 有些数据增强是要放在归一化之后的
  if do_rea:
      res.append(T.RandomErasing(p=rea_prob, value=rea_value))
  # if do_rpt:
  #     res.append(RandomPatch(prob_happen=rpt_prob)) 
yougerli commented 2 years ago



wang11wang commented 2 years ago



class BackgroundGenerator(threading.Thread):
    the usage is below
    >> for batch in BackgroundGenerator(my_minibatch_iterator):
    >>    doit()
    More details are written in the BackgroundGenerator doc
    >> help(BackgroundGenerator)

    def __init__(self, generator, local_rank, max_prefetch=10):
        This function transforms generator into a background-thead generator.
        :param generator: generator or genexp or any
        It can be used with any minibatch generator.
        It is quite lightweight, but not entirely weightless.
        Using global variables inside generator is not recommended (may raise GIL and zero-out the
        benefit of having a background thread.)
        The ideal use case is when everything it requires is store inside it and everything it
        outputs is passed through queue.

max_prefetch 的值改成1;或者不使用DataLoaderX,使用pytorch原生的DataLoader试试

yougerli commented 2 years ago




class BackgroundGenerator(threading.Thread):
    the usage is below
    >> for batch in BackgroundGenerator(my_minibatch_iterator):
    >>    doit()
    More details are written in the BackgroundGenerator doc
    >> help(BackgroundGenerator)

    def __init__(self, generator, local_rank, max_prefetch=10):
        This function transforms generator into a background-thead generator.
        :param generator: generator or genexp or any
        It can be used with any minibatch generator.
        It is quite lightweight, but not entirely weightless.
        Using global variables inside generator is not recommended (may raise GIL and zero-out the
        benefit of having a background thread.)
        The ideal use case is when everything it requires is store inside it and everything it
        outputs is passed through queue.

max_prefetch 的值改成1;或者不使用DataLoaderX,使用pytorch原生的DataLoader试试

wang11wang commented 2 years ago


rrjia commented 2 years ago

我还有修改的一点,就是pytorch官方的dataloader本身也是有内存泄漏的,参考这个issue : https://github.com/pytorch/pytorch/issues/13246

这个帖子里有个大佬的解决方法是: My solution: I follow https://github.com/pytorch/pytorch/issues/1355#issuecomment-341291968 by setting set_start_method but using torch.multiprocessing instead of python multiprocessing . I didn't add cv2.setNumThreads(0)

import torch.multiprocessing as mp

if __name__ == '__main__':


jgysunday commented 2 years ago

我是用DDP训练,也遇到了很多问题。 1、每个批次结束后,会卡住一段时间 我分析相关进程,发现DataLoader的多进程会在一个epoch结束后,会被kill掉,然后在启用新的进程,这个过程花费大量时间。 2、每个批次结束后,出现内存泄漏,最终导致内存不断增长 一个epoch结束后,原有的多进程没有清理干净导致内存泄漏。 解决方法: 将Dataloader的参数persistent_workers设置为True,可以解决上述两个问题。 我用的是Pytorch1.8.1,不知道新版本有没有解决。现在我遇到了share memory的问题,随机出现,没有解决。

yougerli commented 2 years ago


github-actions[bot] commented 2 years ago

This issue is stale because it has been open for 30 days with no activity.

github-actions[bot] commented 2 years ago

This issue was closed because it has been inactive for 14 days since being marked as stale.