训练过程中，内存一直增长，到后期会把整个服务器的内存都占完

rrjia commented 2 years ago

ys.platform            linux
Python                  3.9.7 (default, Sep 16 2021, 13:09:58) [GCC 7.5.0]
numpy                   1.22.4
fastreid                1.3 @/ssd8/exec/jiaruoran/python/fast-reid-master/./fastreid
FASTREID_ENV_MODULE     <not set>
PyTorch                 1.7.1+cu101 @/ssd7/exec/jiaruoran/anaconda3/lib/python3.9/site-packages/torch
PyTorch debug build     False
GPU available           True
GPU 0,1,2,3             Tesla K80
CUDA_HOME               /ssd1/shared/local/cuda-10.1
Pillow                  8.4.0
torchvision             0.8.2+cu101 @/ssd7/exec/jiaruoran/anaconda3/lib/python3.9/site-packages/torchvision
torchvision arch flags  sm_35, sm_50, sm_60, sm_70, sm_75
cv2                     4.5.5
----------------------  -----------------------------------------------------------------------------------
PyTorch built with:
  - GCC 7.3
  - C++ Version: 201402
  - Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v1.6.0 (Git Hash 5ef631a030a6f73131c77892041042805a06064f)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 10.1
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75
  - CuDNN 7.6.3
  - Magma 2.5.2
  - Build settings: BLAS=MKL, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DUSE_VULKAN_WRAPPER -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,

data_loader num_worker设为多个时，内存增长的尤其快，num_worker=0时也会持续增长，排除pytorch dataloader问题

rrjia commented 2 years ago

大佬们没人关注下么

rrjia commented 2 years ago

初步分析，是数据增强中的 AugMix 和 RandomPatch 这两个方法的原因，我把这两个方法去掉后，就正常了，话说京东负责这个项目的大佬都被优化了么？没人回复了

L1aoXingyu commented 2 years ago

感谢反馈问题，负责这个项目的人（就是我）已经去干别的工作了，这个项目目前就是为爱发电的状态，欢迎你提 PR 来修复这个问题，如果我有时间会复现一下这个问题（最近比较忙）。。

rrjia commented 2 years ago

哈哈哈，还以为这个项目的大佬们，都再京东上一波的优化中被毕业了，好的

yougerli commented 2 years ago

我也遇到了内存持续增涨的问题，2000w数据，worker设为4，单卡batchsize 512，基本上训练一个epoch就是内存爆掉。但是我没有用到 AugMix 和 RandomPatch这两个增强方式，甚至去掉了所有的数据增强还是会爆，目前还不清楚原因

rrjia commented 2 years ago

我也遇到了内存持续增涨的问题，2000w数据，worker设为4，单卡batchsize 512，基本上训练一个epoch就是内存爆掉。但是我没有用到 AugMix 和 RandomPatch这两个增强方式，甚至去掉了所有的数据增强还是会爆，目前还不清楚原因

你的没有用这两个增强是再配置文件中吧，你试试再代码中直接把这两个增强注释掉。路径python/fast-reid-master/fastreid/data/transforms/build.py

  # if do_augmix:
  #     res.append(AugMix(prob=augmix_prob))
  res.append(ToTensor())
  ### 有些数据增强是要放在归一化之后的
  if do_rea:
      res.append(T.RandomErasing(p=rea_prob, value=rea_value))
  # if do_rpt:
  #     res.append(RandomPatch(prob_happen=rpt_prob))

yougerli commented 2 years ago

@rrjia

不是的，base配置文件和defaul.py里都是没有设置的，也看了log生成的config.yaml,确定是没有用这两个增强

wang11wang commented 2 years ago

我使用了AugMix，没有发现内存增长的问题；

猜测是这个地方引起的问题：

class BackgroundGenerator(threading.Thread):
    """
    the usage is below
    >> for batch in BackgroundGenerator(my_minibatch_iterator):
    >>    doit()
    More details are written in the BackgroundGenerator doc
    >> help(BackgroundGenerator)
    """

    def __init__(self, generator, local_rank, max_prefetch=10):
        """
        This function transforms generator into a background-thead generator.
        :param generator: generator or genexp or any
        It can be used with any minibatch generator.
        It is quite lightweight, but not entirely weightless.
        Using global variables inside generator is not recommended (may raise GIL and zero-out the
        benefit of having a background thread.)
        The ideal use case is when everything it requires is store inside it and everything it
        outputs is passed through queue.

把 max_prefetch 的值改成1；或者不使用DataLoaderX，使用pytorch原生的DataLoader试试

yougerli commented 2 years ago

max_prefetch设为1就相当于没有数据的预加载了吧，那速度估计还要降低了

我使用了AugMix，没有发现内存增长的问题；

猜测是这个地方引起的问题：

class BackgroundGenerator(threading.Thread):
    """
    the usage is below
    >> for batch in BackgroundGenerator(my_minibatch_iterator):
    >>    doit()
    More details are written in the BackgroundGenerator doc
    >> help(BackgroundGenerator)
    """

    def __init__(self, generator, local_rank, max_prefetch=10):
        """
        This function transforms generator into a background-thead generator.
        :param generator: generator or genexp or any
        It can be used with any minibatch generator.
        It is quite lightweight, but not entirely weightless.
        Using global variables inside generator is not recommended (may raise GIL and zero-out the
        benefit of having a background thread.)
        The ideal use case is when everything it requires is store inside it and everything it
        outputs is passed through queue.

把 max_prefetch 的值改成1；或者不使用DataLoaderX，使用pytorch原生的DataLoader试试

wang11wang commented 2 years ago

1还是有数据预读取的，只是预读一个Batch，设为0才是没有数据预读取；

rrjia commented 2 years ago

我还有修改的一点，就是pytorch官方的dataloader本身也是有内存泄漏的，参考这个issue ： https://github.com/pytorch/pytorch/issues/13246

这个帖子里有个大佬的解决方法是： My solution: I follow https://github.com/pytorch/pytorch/issues/1355#issuecomment-341291968 by setting set_start_method but using torch.multiprocessing instead of python multiprocessing . I didn't add cv2.setNumThreads(0)

import torch.multiprocessing as mp

if __name__ == '__main__':
    mp.set_start_method('spawn')

我就好像修改了这两个地方，就从内存猛涨到不涨了，希望对你们有帮助，上述代码添加在对应的train_net.py中

jgysunday commented 2 years ago

我是用DDP训练，也遇到了很多问题。 1、每个批次结束后，会卡住一段时间我分析相关进程，发现DataLoader的多进程会在一个epoch结束后，会被kill掉，然后在启用新的进程，这个过程花费大量时间。 2、每个批次结束后，出现内存泄漏，最终导致内存不断增长一个epoch结束后，原有的多进程没有清理干净导致内存泄漏。解决方法：将Dataloader的参数persistent_workers设置为True，可以解决上述两个问题。我用的是Pytorch1.8.1，不知道新版本有没有解决。现在我遇到了share memory的问题，随机出现，没有解决。

yougerli commented 2 years ago

我这边发现只有前3个epoch会有内存泄漏，后面就会稳定了

github-actions[bot] commented 2 years ago

This issue is stale because it has been open for 30 days with no activity.

github-actions[bot] commented 2 years ago

This issue was closed because it has been inactive for 14 days since being marked as stale.

JDAI-CV / fast-reid

训练过程中，内存一直增长，到后期会把整个服务器的内存都占完 #673