Encountered freezing during start training at iteration 0

NatchapolShinno commented 6 months ago

I'm attempting to implement ViTGaze, but I've encountered an issue with a specific line of code. Upon investigation, I noticed that it's not utilizing GPU resources at all and is freezing at this point. Below is my logs. Despite several hours having passed, the "Starting training from iteration 0" line still persists. I'm training on videoattentiontarget dataset.


[05/08 16:59:38 detectron2]: Model:
GazeAttentionMapper(
  (backbone): ViT(
    (patch_embed): PatchEmbed(
      (proj): Conv2d(3, 384, kernel_size=(14, 14), stride=(14, 14))
    )
    (extra_pos_embed): Identity()
    (blocks): ModuleList(
      (0-11): 12 x Block(
        (norm1): LayerNorm((384,), eps=1e-06, elementwise_affine=True)
        (attn): Attention(
          (qkv): Linear(in_features=384, out_features=1152, bias=True)
          (proj): Linear(in_features=384, out_features=384, bias=True)
        )
        (ls1): LayerScale()
        (drop_path): Identity()
        (norm2): LayerNorm((384,), eps=1e-06, elementwise_affine=True)
        (mlp): Mlp(
          (fc1): Linear(in_features=384, out_features=1536, bias=True)
          (act): GELU(approximate='none')
          (drop1): Dropout(p=0.0, inplace=False)
          (norm): Identity()
          (fc2): Linear(in_features=1536, out_features=384, bias=True)
          (drop2): Dropout(p=0.0, inplace=False)
        )
        (ls2): LayerScale()
      )
    )
    (norm): Identity()
  )
  (pam): PatchPAM(
    (patch_embed): Sequential(
      (patch_embed): Conv2d(3, 8, kernel_size=(14, 14), stride=(14, 14))
      (act_layer): ReLU(inplace=True)
    )
    (embed): Conv2d(8, 1, kernel_size=(1, 1), stride=(1, 1))
    (aux_embed): Conv2d(8, 1, kernel_size=(1, 1), stride=(1, 1))
  )
  (regressor): UpSampleConv(
    (pre_norm): Identity()
    (conv): Identity()
    (decoder): Sequential(
      (upsample1): Upsample(scale_factor=2.0, mode='bilinear')
      (conv1): Conv2d(24, 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (bn1): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu1): ReLU(inplace=True)
      (upsample2): Upsample(scale_factor=2.0, mode='bilinear')
      (conv2): Conv2d(16, 8, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (bn2): BatchNorm2d(8, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu2): ReLU(inplace=True)
      (upsample3): Upsample(scale_factor=2.0, mode='bilinear')
      (conv3): Conv2d(8, 1, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (bn3): BatchNorm2d(1, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu3): ReLU(inplace=True)
      (conv): Conv2d(1, 1, kernel_size=(1, 1), stride=(1, 1))
    )
  )
  (classifier): SimpleMlp(
    (classifier): Sequential(
      (dropout0): Dropout(p=0, inplace=False)
      (linear0): Linear(in_features=384, out_features=384, bias=True)
      (relu): ReLU()
      (dropout1): Dropout(p=0, inplace=False)
      (linear1): Linear(in_features=384, out_features=1, bias=True)
    )
  )
  (criterion): GazeMapperCriterion(
    (heatmap_loss): MSELoss()
    (aux_loss): BCEWithLogitsLoss()
  )
)
[05/08 16:59:40 d2.checkpoint.detection_checkpoint]: [DetectionCheckpointer] Loading from /home/slab/ViTGaze/output/gazefollow_518/model_final.pth ...
[05/08 16:59:40 fvcore.common.checkpoint]: [Checkpointer] Loading from /home/slab/ViTGaze/output/gazefollow_518/model_final.pth ...
[05/08 16:59:40 d2.engine.train_loop]: Starting training from iteration 0

Environment:

[05/08 16:59:38 detectron2]: Environment info:
-------------------------------  -----------------------------------------------------------------------
sys.platform                     linux
Python                           3.8.10 (default, Jun  4 2021, 15:09:15) [GCC 7.5.0]
numpy                            1.23.5
detectron2                       0.6 @/home/slab/.local/lib/python3.8/site-packages/detectron2
Compiler                         GCC 9.4
CUDA compiler                    CUDA 11.7
detectron2 arch flags            7.5
DETECTRON2_ENV_MODULE            <not set>
PyTorch                          2.0.1+cu117 @/home/slab/.local/lib/python3.8/site-packages/torch
PyTorch debug build              False
torch._C._GLIBCXX_USE_CXX11_ABI  False
GPU available                    Yes
GPU 0                            NVIDIA TITAN RTX (arch=7.5)
Driver version                   515.43.04
CUDA_HOME                        /usr/local/cuda-11.7
Pillow                           10.3.0
torchvision                      0.15.2+cu117 @/home/slab/.local/lib/python3.8/site-packages/torchvision
torchvision arch flags           3.5, 5.0, 6.0, 7.0, 7.5, 8.0, 8.6
fvcore                           0.1.5.post20221221
iopath                           0.1.9
cv2                              4.8.1
-------------------------------  -----------------------------------------------------------------------
PyTorch built with:
  - GCC 9.3
  - C++ Version: 201703
  - Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v2.7.3 (Git Hash 6dbeffbae1f23cbbeae17adb7b5b13f1f37c080e)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 11.7
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86
  - CuDNN 8.5
  - Magma 2.6.1
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.7, CUDNN_VERSION=8.5.0, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wunused-local-typedefs -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_DISABLE_GPU_ASSERTS=ON, TORCH_VERSION=2.0.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF,

Could you please help me? Thank you in advance.

github-actions[bot] commented 6 months ago

You've chosen to report an unexpected problem or bug. Unless you already know the root cause of it, please include details about it by filling the issue template. The following information is missing: "Instructions To Reproduce the Issue and Full Logs";

Programmer-RD-AI commented 5 months ago

Hi, This may be caused due to the size of the dataset and the model size as well, I would recommend you try and train a basic level model first and see performance... Thank you

facebookresearch / detectron2

Encountered freezing during start training at iteration 0 #5281

Environment: