error in ms_deformable_im2col_cuda: no kernel image is available for execution on the device

using /data Preparation done. Between equal marks is user's output: /root/conda/bin/python running build running build_py running build_ext building 'MultiScaleDeformableAttention' extension Emitting ninja build file /workspace/mask2former/modeling/pixel_decoder/ops/build/temp.linux-x86_64-3.7/build.ninja... Compiling objects... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. g++ -pthread -shared -B /root/conda/compiler_compat -L/root/conda/lib -Wl,-rpath=/root/conda/lib -Wl,--no-as-needed -Wl,--sysroot=/ /workspace/mask2former/modeling/pixel_decoder/ops/build/temp.linux-x86_64-3.7/workspace/mask2former/modeling/pixel_decoder/ops/src/vision.o /workspace/mask2former/modeling/pixel_decoder/ops/build/temp.linux-x86_64-3.7/workspace/mask2former/modeling/pixel_decoder/ops/src/cpu/ms_deform_attn_cpu.o /workspace/mask2former/modeling/pixel_decoder/ops/build/temp.linux-x86_64-3.7/workspace/mask2former/modeling/pixel_decoder/ops/src/cuda/ms_deform_attn_cuda.o -L/root/conda/lib/python3.7/site-packages/torch/lib -L/usr/local/cuda/lib64 -lc10 -ltorch -ltorch_cpu -ltorch_python -lcudart -lc10_cuda -ltorch_cuda_cu -ltorch_cuda_cpp -o build/lib.linux-x86_64-3.7/MultiScaleDeformableAttention.cpython-37m-x86_64-linux-gnu.so running install running bdist_egg running egg_info writing MultiScaleDeformableAttention.egg-info/PKG-INFO writing dependency_links to MultiScaleDeformableAttention.egg-info/dependency_links.txt writing top-level names to MultiScaleDeformableAttention.egg-info/top_level.txt reading manifest file 'MultiScaleDeformableAttention.egg-info/SOURCES.txt' writing manifest file 'MultiScaleDeformableAttention.egg-info/SOURCES.txt' installing library code to build/bdist.linux-x86_64/egg running install_lib creating build/bdist.linux-x86_64/egg creating build/bdist.linux-x86_64/egg/functions copying build/lib.linux-x86_64-3.7/functions/init.py -> build/bdist.linux-x86_64/egg/functions copying build/lib.linux-x86_64-3.7/functions/ms_deform_attn_func.py -> build/bdist.linux-x86_64/egg/functions copying build/lib.linux-x86_64-3.7/MultiScaleDeformableAttention.cpython-37m-x86_64-linux-gnu.so -> build/bdist.linux-x86_64/egg creating build/bdist.linux-x86_64/egg/modules copying build/lib.linux-x86_64-3.7/modules/ms_deform_attn.py -> build/bdist.linux-x86_64/egg/modules copying build/lib.linux-x86_64-3.7/modules/init.py -> build/bdist.linux-x86_64/egg/modules byte-compiling build/bdist.linux-x86_64/egg/functions/init.py to init.cpython-37.pyc byte-compiling build/bdist.linux-x86_64/egg/functions/ms_deform_attn_func.py to ms_deform_attn_func.cpython-37.pyc byte-compiling build/bdist.linux-x86_64/egg/modules/ms_deform_attn.py to ms_deform_attn.cpython-37.pyc byte-compiling build/bdist.linux-x86_64/egg/modules/init.py to init.cpython-37.pyc creating stub loader for MultiScaleDeformableAttention.cpython-37m-x86_64-linux-gnu.so byte-compiling build/bdist.linux-x86_64/egg/MultiScaleDeformableAttention.py to MultiScaleDeformableAttention.cpython-37.pyc creating build/bdist.linux-x86_64/egg/EGG-INFO copying MultiScaleDeformableAttention.egg-info/PKG-INFO -> build/bdist.linux-x86_64/egg/EGG-INFO copying MultiScaleDeformableAttention.egg-info/SOURCES.txt -> build/bdist.linux-x86_64/egg/EGG-INFO copying MultiScaleDeformableAttention.egg-info/dependency_links.txt -> build/bdist.linux-x86_64/egg/EGG-INFO copying MultiScaleDeformableAttention.egg-info/top_level.txt -> build/bdist.linux-x86_64/egg/EGG-INFO writing build/bdist.linux-x86_64/egg/EGG-INFO/native_libs.txt zip_safe flag not set; analyzing archive contents... pycache.MultiScaleDeformableAttention.cpython-37: module references file creating 'dist/MultiScaleDeformableAttention-1.0-py3.7-linux-x86_64.egg' and adding 'build/bdist.linux-x86_64/egg' to it removing 'build/bdist.linux-x86_64/egg' (and everything under it) Processing MultiScaleDeformableAttention-1.0-py3.7-linux-x86_64.egg removing '/root/conda/lib/python3.7/site-packages/MultiScaleDeformableAttention-1.0-py3.7-linux-x86_64.egg' (and everything under it) creating /root/conda/lib/python3.7/site-packages/MultiScaleDeformableAttention-1.0-py3.7-linux-x86_64.egg Extracting MultiScaleDeformableAttention-1.0-py3.7-linux-x86_64.egg to /root/conda/lib/python3.7/site-packages MultiScaleDeformableAttention 1.0 is already the active version in easy-install.pth

Installed /root/conda/lib/python3.7/site-packages/MultiScaleDeformableAttention-1.0-py3.7-linux-x86_64.egg Processing dependencies for MultiScaleDeformableAttention==1.0 Finished processing dependencies for MultiScaleDeformableAttention==1.0 run on: autodrive DETECTRON2_DATASETS: /data/bolu.ldz/DATASET Command Line Args: Namespace(config_file='configs/youtubevis_2019/video_maskformer2_R50_bs16_8ep.yaml', dist_url='tcp://127.0.0.1:49152', eval_only=False, machine_rank=0, num_gpus=8, num_machines=1, opts=[], resume=False) run on: autodrive DETECTRON2_DATASETS: /data/bolu.ldz/DATASET [02/22 03:41:38 detectron2]: Rank of current process: 0. World size: 8 [02/22 03:41:40 detectron2]: Environment info:

sys.platform linux Python 3.7.7 (default, May 7 2020, 21:25:33) [GCC 7.3.0] numpy 1.19.2 detectron2 0.6 @/root/conda/lib/python3.7/site-packages/detectron2 Compiler GCC 7.3 CUDA compiler CUDA 11.1 detectron2 arch flags 3.7, 5.0, 5.2, 6.0, 6.1, 7.0, 7.5, 8.0, 8.6 DETECTRON2_ENV_MODULE PyTorch 1.9.0 @/root/conda/lib/python3.7/site-packages/torch PyTorch debug build False GPU available Yes GPU 0,1,2,3,4,5,6,7 GeForce RTX 3090 (arch=8.6) Driver version 460.73.01 CUDA_HOME /usr/local/cuda TORCH_CUDA_ARCH_LIST 6.0;6.1;6.2;7.0;7.5 Pillow 8.0.1 torchvision 0.10.0 @/root/conda/lib/python3.7/site-packages/torchvision torchvision arch flags 3.5, 5.0, 6.0, 7.0, 7.5, 8.0, 8.6 fvcore 0.1.5.post20220212 iopath 0.1.9 cv2 4.1.2

PyTorch built with:

GCC 7.3
C++ Version: 201402
Intel(R) Math Kernel Library Version 2020.0.2 Product Build 20200624 for Intel(R) 64 architecture applications
Intel(R) MKL-DNN v2.1.2 (Git Hash 98be7e8afa711dc9b66c8ff3504129cb82013cdb)
OpenMP 201511 (a.k.a. OpenMP 4.5)
NNPACK is enabled
CPU capability usage: AVX2
CUDA Runtime 11.1
NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37
CuDNN 8.0.5
Magma 2.5.2
Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.1, CUDNN_VERSION=8.0.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.9.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,

[02/22 03:41:40 detectron2]: Command line arguments: Namespace(config_file='configs/youtubevis_2019/video_maskformer2_R50_bs16_8ep.yaml', dist_url='tcp://127.0.0.1:49152', eval_only=False, machine_rank=0, num_gpus=8, num_machines=1, opts=[], resume=False) [02/22 03:41:40 detectron2]: Contents of args.config_file=configs/youtubevis_2019/video_maskformer2_R50_bs16_8ep.yaml: BASE: Base-YouTubeVIS-VideoInstanceSegmentation.yaml MODEL: WEIGHTS: 186m"186mmodel_final_3c8ec9.pkl186m" META_ARCHITECTURE: 186m"186mVideoMaskFormer186m" SEM_SEG_HEAD: NAME: 186m"186mMaskFormerHead186m" IGNORE_VALUE: 255 NUM_CLASSES: 40 LOSS_WEIGHT: 1.0 CONVS_DIM: 256 MASK_DIM: 256 NORM: 186m"186mGN186m" 242m# pixel decoder PIXEL_DECODER_NAME: 186m"186mMSDeformAttnPixelDecoder186m" IN_FEATURES: [186m"186mres2186m", 186m"186mres3186m", 186m"186mres4186m", 186m"186mres5186m"] DEFORMABLE_TRANSFORMER_ENCODER_IN_FEATURES: [186m"186mres3186m", 186m"186mres4186m", 186m"186mres5186m"] COMMON_STRIDE: 4 TRANSFORMER_ENC_LAYERS: 6 MASK_FORMER: TRANSFORMER_DECODER_NAME: 186m"186mVideoMultiScaleMaskedTransformerDecoder186m" TRANSFORMER_IN_FEATURE: 186m"186mmulti_scale_pixel_decoder186m" DEEP_SUPERVISION: True NO_OBJECT_WEIGHT: 0.1 CLASS_WEIGHT: 2.0 MASK_WEIGHT: 5.0 DICE_WEIGHT: 5.0 HIDDEN_DIM: 256 NUM_OBJECT_QUERIES: 100 NHEADS: 8 DROPOUT: 0.0 DIM_FEEDFORWARD: 2048 ENC_LAYERS: 0 PRE_NORM: False ENFORCE_INPUT_PROJ: False SIZE_DIVISIBILITY: 32 DEC_LAYERS: 10 242m# 9 decoder layers, add one for the loss on learnable query TRAIN_NUM_POINTS: 12544 OVERSAMPLE_RATIO: 3.0 IMPORTANCE_SAMPLE_RATIO: 0.75 TEST: SEMANTIC_ON: False INSTANCE_ON: True PANOPTIC_ON: False OVERLAP_THRESHOLD: 0.8 OBJECT_MASK_THRESHOLD: 0.8

[02/22 03:41:40 detectron2]: Running with full config: CUDNN_BENCHMARK: false DATALOADER: ASPECT_RATIO_GROUPING: true FILTER_EMPTY_ANNOTATIONS: false NUM_WORKERS: 4 REPEAT_THRESHOLD: 0.0 SAMPLER_TRAIN: TrainingSampler DATASETS: PRECOMPUTED_PROPOSAL_TOPK_TEST: 1000 PRECOMPUTED_PROPOSAL_TOPK_TRAIN: 2000 PROPOSAL_FILES_TEST: [] PROPOSAL_FILES_TRAIN: [] TEST:

ytvis_2019_val TRAIN:
ytvis_2019_train GLOBAL: HACK: 1.0 INPUT: AUGMENTATIONS: [] COLOR_AUG_SSD: false CROP: ENABLED: false SINGLE_CATEGORY_MAX_AREA: 1.0 SIZE:
- 600
- 720 TYPE: absolute_range DATASET_MAPPER_NAME: mask_former_semantic FORMAT: RGB IMAGE_SIZE: 1024 MASK_FORMAT: polygon MAX_SCALE: 2.0 MAX_SIZE_TEST: 1333 MAX_SIZE_TRAIN: 1333 MIN_SCALE: 0.1 MIN_SIZE_TEST: 360 MIN_SIZE_TRAIN:
360
480 MIN_SIZE_TRAIN_SAMPLING: choice_by_clip RANDOM_FLIP: flip_by_clip SAMPLING_FRAME_NUM: 2 SAMPLING_FRAME_RANGE: 20 SAMPLING_FRAME_SHUFFLE: false SIZE_DIVISIBILITY: -1 MODEL: ANCHOR_GENERATOR: ANGLES:
- - -90
  - 0
  - 90 ASPECT_RATIOS:
- - 0.5
  - 1.0
  - 2.0 NAME: DefaultAnchorGenerator OFFSET: 0.0 SIZES:
- - 32
  - 64
  - 128
  - 256
  - 512 BACKBONE: FREEZE_AT: 0 NAME: build_resnet_backbone DEVICE: cuda FPN: FUSE_TYPE: sum IN_FEATURES: [] NORM: 186m'186m' OUT_CHANNELS: 256 KEYPOINT_ON: false LOAD_PROPOSALS: false MASK_FORMER: CLASS_WEIGHT: 2.0 DEC_LAYERS: 10 DEEP_SUPERVISION: true DICE_WEIGHT: 5.0 DIM_FEEDFORWARD: 2048 DROPOUT: 0.0 ENC_LAYERS: 0 ENFORCE_INPUT_PROJ: false HIDDEN_DIM: 256 IMPORTANCE_SAMPLE_RATIO: 0.75 MASK_WEIGHT: 5.0 NHEADS: 8 NO_OBJECT_WEIGHT: 0.1 NUM_OBJECT_QUERIES: 100 OVERSAMPLE_RATIO: 3.0 PRE_NORM: false SIZE_DIVISIBILITY: 32 TEST: INSTANCE_ON: true OBJECT_MASK_THRESHOLD: 0.8 OVERLAP_THRESHOLD: 0.8 PANOPTIC_ON: false SEMANTIC_ON: false SEM_SEG_POSTPROCESSING_BEFORE_INFERENCE: false TRAIN_NUM_POINTS: 12544 TRANSFORMER_DECODER_NAME: VideoMultiScaleMaskedTransformerDecoder TRANSFORMER_IN_FEATURE: multi_scale_pixel_decoder MASK_ON: true META_ARCHITECTURE: VideoMaskFormer PANOPTIC_FPN: COMBINE: ENABLED: true INSTANCES_CONFIDENCE_THRESH: 0.5 OVERLAP_THRESH: 0.5 STUFF_AREA_LIMIT: 4096 INSTANCE_LOSS_WEIGHT: 1.0 PIXEL_MEAN:
123.675
116.28
103.53 PIXEL_STD:
58.395
57.12
57.375 PROPOSAL_GENERATOR: MIN_SIZE: 0 NAME: RPN RESNETS: DEFORM_MODULATED: false DEFORM_NUM_GROUPS: 1 DEFORM_ON_PER_STAGE:
- false
- false
- false
- false DEPTH: 50 NORM: FrozenBN NUM_GROUPS: 1 OUT_FEATURES:
- res2
- res3
- res4
- res5 RES2_OUT_CHANNELS: 256 RES4_DILATION: 1 RES5_DILATION: 1 RES5_MULTI_GRID:
- 1
- 1
- 1 STEM_OUT_CHANNELS: 64 STEM_TYPE: basic STRIDE_IN_1X1: false WIDTH_PER_GROUP: 64 RETINANET: BBOX_REG_LOSS_TYPE: smooth_l1 BBOX_REG_WEIGHTS: &id001
- 1.0
- 1.0
- 1.0
- 1.0 FOCAL_LOSS_ALPHA: 0.25 FOCAL_LOSS_GAMMA: 2.0 IN_FEATURES:
- p3
- p4
- p5
- p6
- p7 IOU_LABELS:
- 0
- -1
- 1 IOU_THRESHOLDS:
- 0.4
- 0.5 NMS_THRESH_TEST: 0.5 NORM: 186m'186m' NUM_CLASSES: 80 NUM_CONVS: 4 PRIOR_PROB: 0.01 SCORE_THRESH_TEST: 0.05 SMOOTH_L1_LOSS_BETA: 0.1 TOPK_CANDIDATES_TEST: 1000 ROI_BOX_CASCADE_HEAD: BBOX_REG_WEIGHTS:
- - 10.0
  - 10.0
  - 5.0
  - 5.0
- - 20.0
  - 20.0
  - 10.0
  - 10.0
- - 30.0
  - 30.0
  - 15.0
  - 15.0 IOUS:
- 0.5
- 0.6
- 0.7 ROI_BOX_HEAD: BBOX_REG_LOSS_TYPE: smooth_l1 BBOX_REG_LOSS_WEIGHT: 1.0 BBOX_REG_WEIGHTS:
- 10.0
- 10.0
- 5.0
- 5.0 CLS_AGNOSTIC_BBOX_REG: false CONV_DIM: 256 FC_DIM: 1024 NAME: 186m'186m' NORM: 186m'186m' NUM_CONV: 0 NUM_FC: 0 POOLER_RESOLUTION: 14 POOLER_SAMPLING_RATIO: 0 POOLER_TYPE: ROIAlignV2 SMOOTH_L1_BETA: 0.0 TRAIN_ON_PRED_BOXES: false ROI_HEADS: BATCH_SIZE_PER_IMAGE: 512 IN_FEATURES:
- res4 IOU_LABELS:
- 0
- 1 IOU_THRESHOLDS:
- 0.5 NAME: Res5ROIHeads NMS_THRESH_TEST: 0.5 NUM_CLASSES: 80 POSITIVE_FRACTION: 0.25 PROPOSAL_APPEND_GT: true SCORE_THRESH_TEST: 0.05 ROI_KEYPOINT_HEAD: CONV_DIMS:
- 512
- 512
- 512
- 512
- 512
- 512
- 512
- 512 LOSS_WEIGHT: 1.0 MIN_KEYPOINTS_PER_IMAGE: 1 NAME: KRCNNConvDeconvUpsampleHead NORMALIZE_LOSS_BY_VISIBLE_KEYPOINTS: true NUM_KEYPOINTS: 17 POOLER_RESOLUTION: 14 POOLER_SAMPLING_RATIO: 0 POOLER_TYPE: ROIAlignV2 ROI_MASK_HEAD: CLS_AGNOSTIC_MASK: false CONV_DIM: 256 NAME: MaskRCNNConvUpsampleHead NORM: 186m'186m' NUM_CONV: 0 POOLER_RESOLUTION: 14 POOLER_SAMPLING_RATIO: 0 POOLER_TYPE: ROIAlignV2 RPN: BATCH_SIZE_PER_IMAGE: 256 BBOX_REG_LOSS_TYPE: smooth_l1 BBOX_REG_LOSS_WEIGHT: 1.0 BBOX_REG_WEIGHTS: *id001 BOUNDARY_THRESH: -1 CONV_DIMS:
- -1 HEAD_NAME: StandardRPNHead IN_FEATURES:
- res4 IOU_LABELS:
- 0
- -1
- 1 IOU_THRESHOLDS:
- 0.3
- 0.7 LOSS_WEIGHT: 1.0 NMS_THRESH: 0.7 POSITIVE_FRACTION: 0.5 POST_NMS_TOPK_TEST: 1000 POST_NMS_TOPK_TRAIN: 2000 PRE_NMS_TOPK_TEST: 6000 PRE_NMS_TOPK_TRAIN: 12000 SMOOTH_L1_BETA: 0.0 SEM_SEG_HEAD: ASPP_CHANNELS: 256 ASPP_DILATIONS:
- 6
- 12
- 18 ASPP_DROPOUT: 0.1 COMMON_STRIDE: 4 CONVS_DIM: 256 DEFORMABLE_TRANSFORMER_ENCODER_IN_FEATURES:
- res3
- res4
- res5 DEFORMABLE_TRANSFORMER_ENCODER_N_HEADS: 8 DEFORMABLE_TRANSFORMER_ENCODER_N_POINTS: 4 IGNORE_VALUE: 255 IN_FEATURES:
- res2
- res3
- res4
- res5 LOSS_TYPE: hard_pixel_mining LOSS_WEIGHT: 1.0 MASK_DIM: 256 NAME: MaskFormerHead NORM: GN NUM_CLASSES: 40 PIXEL_DECODER_NAME: MSDeformAttnPixelDecoder PROJECT_CHANNELS:
- 48 PROJECT_FEATURES:
- res2 TRANSFORMER_ENC_LAYERS: 6 USE_DEPTHWISE_SEPARABLE_CONV: false SWIN: APE: false ATTN_DROP_RATE: 0.0 DEPTHS:
- 2
- 2
- 6
- 2 DROP_PATH_RATE: 0.3 DROP_RATE: 0.0 EMBED_DIM: 96 MLP_RATIO: 4.0 NUM_HEADS:
- 3
- 6
- 12
- 24 OUT_FEATURES:
- res2
- res3
- res4
- res5 PATCH_NORM: true PATCH_SIZE: 4 PRETRAIN_IMG_SIZE: 224 QKV_BIAS: true QK_SCALE: null USE_CHECKPOINT: false WINDOW_SIZE: 7 WEIGHTS: /data/bolu.ldz/PRETRAINED_WEIGHTS/mask2former/model_final_3c8ec9.pkl OUTPUT_DIR: /summary SEED: -1 SOLVER: AMP: ENABLED: true BACKBONE_MULTIPLIER: 0.1 BASE_LR: 0.0001 BIAS_LR_FACTOR: 1.0 CHECKPOINT_PERIOD: 5000 CLIP_GRADIENTS: CLIP_TYPE: full_model CLIP_VALUE: 0.01 ENABLED: true NORM_TYPE: 2.0 GAMMA: 0.1 IMS_PER_BATCH: 16 LR_SCHEDULER_NAME: WarmupMultiStepLR MAX_ITER: 6000 MOMENTUM: 0.9 NESTEROV: false OPTIMIZER: ADAMW POLY_LR_CONSTANT_ENDING: 0.0 POLY_LR_POWER: 0.9 REFERENCE_WORLD_SIZE: 0 STEPS:
4000 WARMUP_FACTOR: 1.0 WARMUP_ITERS: 10 WARMUP_METHOD: linear WEIGHT_DECAY: 0.05 WEIGHT_DECAY_BIAS: null WEIGHT_DECAY_EMBED: 0.0 WEIGHT_DECAY_NORM: 0.0 TEST: AUG: ENABLED: false FLIP: true MAX_SIZE: 4000 MIN_SIZES:
- 400
- 500
- 600
- 700
- 800
- 900
- 1000
- 1100
- 1200 DETECTIONS_PER_IMAGE: 100 EVAL_PERIOD: 0 EXPECTED_RESULTS: [] KEYPOINT_OKS_SIGMAS: [] PRECISE_BN: ENABLED: false NUM_ITER: 200 VERSION: 2 VIS_PERIOD: 0

[02/22 03:41:40 detectron2]: Full config saved to /summary/config.yaml [02/22 03:41:40 d2.utils.env]: Using a generated random seed 40230477

VideoMaskFormer( (backbone): ResNet( (stem): BasicStem( (conv1): Conv2d( 3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False (norm): FrozenBatchNorm2d(num_features=64, eps=1e-05) ) ) (res2): Sequential( (0): BottleneckBlock( (shortcut): Conv2d( 64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) (conv1): Conv2d( 64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=64, eps=1e-05) ) (conv2): Conv2d( 64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=64, eps=1e-05) ) (conv3): Conv2d( 64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) ) (1): BottleneckBlock( (conv1): Conv2d( 256, 64, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=64, eps=1e-05) ) (conv2): Conv2d( 64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=64, eps=1e-05) ) (conv3): Conv2d( 64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) ) (2): BottleneckBlock( (conv1): Conv2d( 256, 64, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=64, eps=1e-05) ) (conv2): Conv2d( 64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=64, eps=1e-05) ) (conv3): Conv2d( 64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) ) ) (res3): Sequential( (0): BottleneckBlock( (shortcut): Conv2d( 256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05) ) (conv1): Conv2d( 256, 128, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=128, eps=1e-05) ) (conv2): Conv2d( 128, 128, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=128, eps=1e-05) ) (conv3): Conv2d( 128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05) ) ) (1): BottleneckBlock( (conv1): Conv2d( 512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=128, eps=1e-05) ) (conv2): Conv2d( 128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=128, eps=1e-05) ) (conv3): Conv2d( 128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05) ) ) (2): BottleneckBlock( (conv1): Conv2d( 512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=128, eps=1e-05) ) (conv2): Conv2d( 128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=128, eps=1e-05) ) (conv3): Conv2d( 128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05) ) ) (3): BottleneckBlock( (conv1): Conv2d( 512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=128, eps=1e-05) ) (conv2): Conv2d( 128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=128, eps=1e-05) ) (conv3): Conv2d( 128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05) ) ) ) (res4): Sequential( (0): BottleneckBlock( (shortcut): Conv2d( 512, 1024, kernel_size=(1, 1), stride=(2, 2), bias=False (norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05) ) (conv1): Conv2d( 512, 256, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) (conv2): Conv2d( 256, 256, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) (conv3): Conv2d( 256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05) ) ) (1): BottleneckBlock( (conv1): Conv2d( 1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) (conv2): Conv2d( 256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) (conv3): Conv2d( 256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05) ) ) (2): BottleneckBlock( (conv1): Conv2d( 1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) (conv2): Conv2d( 256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) (conv3): Conv2d( 256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05) ) ) (3): BottleneckBlock( (conv1): Conv2d( 1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) (conv2): Conv2d( 256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) (conv3): Conv2d( 256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05) ) ) (4): BottleneckBlock( (conv1): Conv2d( 1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) (conv2): Conv2d( 256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) (conv3): Conv2d( 256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05) ) ) (5): BottleneckBlock( (conv1): Conv2d( 1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) (conv2): Conv2d( 256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) (conv3): Conv2d( 256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05) ) ) ) (res5): Sequential( (0): BottleneckBlock( (shortcut): Conv2d( 1024, 2048, kernel_size=(1, 1), stride=(2, 2), bias=False (norm): FrozenBatchNorm2d(num_features=2048, eps=1e-05) ) (conv1): Conv2d( 1024, 512, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05) ) (conv2): Conv2d( 512, 512, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05) ) (conv3): Conv2d( 512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=2048, eps=1e-05) ) ) (1): BottleneckBlock( (conv1): Conv2d( 2048, 512, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05) ) (conv2): Conv2d( 512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05) ) (conv3): Conv2d( 512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=2048, eps=1e-05) ) ) (2): BottleneckBlock( (conv1): Conv2d( 2048, 512, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05) ) (conv2): Conv2d( 512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05) ) (conv3): Conv2d( 512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=2048, eps=1e-05) ) ) ) ) (sem_seg_head): MaskFormerHead( (pixel_decoder): MSDeformAttnPixelDecoder( (input_proj): ModuleList( (0): Sequential( (0): Conv2d(2048, 256, kernel_size=(1, 1), stride=(1, 1)) (1): GroupNorm(32, 256, eps=1e-05, affine=True) ) (1): Sequential( (0): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1)) (1): GroupNorm(32, 256, eps=1e-05, affine=True) ) (2): Sequential( (0): Conv2d(512, 256, kernel_size=(1, 1), stride=(1, 1)) (1): GroupNorm(32, 256, eps=1e-05, affine=True) ) ) (transformer): MSDeformAttnTransformerEncoderOnly( (encoder): MSDeformAttnTransformerEncoder( (layers): ModuleList( (0): MSDeformAttnTransformerEncoderLayer( (self_attn): MSDeformAttn( (sampling_offsets): Linear(in_features=256, out_features=192, bias=True) (attention_weights): Linear(in_features=256, out_features=96, bias=True) (value_proj): Linear(in_features=256, out_features=256, bias=True) (output_proj): Linear(in_features=256, out_features=256, bias=True) ) (dropout1): Dropout(p=0.0, inplace=False) (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (linear1): Linear(in_features=256, out_features=1024, bias=True) (dropout2): Dropout(p=0.0, inplace=False) (linear2): Linear(in_features=1024, out_features=256, bias=True) (dropout3): Dropout(p=0.0, inplace=False) (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True) ) (1): MSDeformAttnTransformerEncoderLayer( (self_attn): MSDeformAttn( (sampling_offsets): Linear(in_features=256, out_features=192, bias=True) (attention_weights): Linear(in_features=256, out_features=96, bias=True) (value_proj): Linear(in_features=256, out_features=256, bias=True) (output_proj): Linear(in_features=256, out_features=256, bias=True) ) (dropout1): Dropout(p=0.0, inplace=False) (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (linear1): Linear(in_features=256, out_features=1024, bias=True) (dropout2): Dropout(p=0.0, inplace=False) (linear2): Linear(in_features=1024, out_features=256, bias=True) (dropout3): Dropout(p=0.0, inplace=False) (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True) ) (2): MSDeformAttnTransformerEncoderLayer( (self_attn): MSDeformAttn( (sampling_offsets): Linear(in_features=256, out_features=192, bias=True) (attention_weights): Linear(in_features=256, out_features=96, bias=True) (value_proj): Linear(in_features=256, out_features=256, bias=True) (output_proj): Linear(in_features=256, out_features=256, bias=True) ) (dropout1): Dropout(p=0.0, inplace=False) (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (linear1): Linear(in_features=256, out_features=1024, bias=True) (dropout2): Dropout(p=0.0, inplace=False) (linear2): Linear(in_features=1024, out_features=256, bias=True) (dropout3): Dropout(p=0.0, inplace=False) (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True) ) (3): MSDeformAttnTransformerEncoderLayer( (self_attn): MSDeformAttn( (sampling_offsets): Linear(in_features=256, out_features=192, bias=True) (attention_weights): Linear(in_features=256, out_features=96, bias=True) (value_proj): Linear(in_features=256, out_features=256, bias=True) (output_proj): Linear(in_features=256, out_features=256, bias=True) ) (dropout1): Dropout(p=0.0, inplace=False) (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (linear1): Linear(in_features=256, out_features=1024, bias=True) (dropout2): Dropout(p=0.0, inplace=False) (linear2): Linear(in_features=1024, out_features=256, bias=True) (dropout3): Dropout(p=0.0, inplace=False) (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True) ) (4): MSDeformAttnTransformerEncoderLayer( (self_attn): MSDeformAttn( (sampling_offsets): Linear(in_features=256, out_features=192, bias=True) (attention_weights): Linear(in_features=256, out_features=96, bias=True) (value_proj): Linear(in_features=256, out_features=256, bias=True) (output_proj): Linear(in_features=256, out_features=256, bias=True) ) (dropout1): Dropout(p=0.0, inplace=False) (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (linear1): Linear(in_features=256, out_features=1024, bias=True) (dropout2): Dropout(p=0.0, inplace=False) (linear2): Linear(in_features=1024, out_features=256, bias=True) (dropout3): Dropout(p=0.0, inplace=False) (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True) ) (5): MSDeformAttnTransformerEncoderLayer( (self_attn): MSDeformAttn( (sampling_offsets): Linear(in_features=256, out_features=192, bias=True) (attention_weights): Linear(in_features=256, out_features=96, bias=True) (value_proj): Linear(in_features=256, out_features=256, bias=True) (output_proj): Linear(in_features=256, out_features=256, bias=True) ) (dropout1): Dropout(p=0.0, inplace=False) (norm1): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (linear1): Linear(in_features=256, out_features=1024, bias=True) (dropout2): Dropout(p=0.0, inplace=False) (linear2): Linear(in_features=1024, out_features=256, bias=True) (dropout3): Dropout(p=0.0, inplace=False) (norm2): LayerNorm((256,), eps=1e-05, elementwise_affine=True) ) ) ) ) (pe_layer): Positional encoding PositionEmbeddingSine num_pos_feats: 128 temperature: 10000 normalize: True scale: 6.283185307179586 (mask_features): Conv2d(256, 256, kernel_size=(1, 1), stride=(1, 1)) (adapter_1): Conv2d( 256, 256, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): GroupNorm(32, 256, eps=1e-05, affine=True) ) (layer_1): Conv2d( 256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): GroupNorm(32, 256, eps=1e-05, affine=True) ) ) (predictor): VideoMultiScaleMaskedTransformerDecoder( (pe_layer): PositionEmbeddingSine3D() (transformer_self_attention_layers): ModuleList( (0): SelfAttentionLayer( (self_attn): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True) ) (norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.0, inplace=False) ) (1): SelfAttentionLayer( (self_attn): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True) ) (norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.0, inplace=False) ) (2): SelfAttentionLayer( (self_attn): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True) ) (norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.0, inplace=False) ) (3): SelfAttentionLayer( (self_attn): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True) ) (norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.0, inplace=False) ) (4): SelfAttentionLayer( (self_attn): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True) ) (norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.0, inplace=False) ) (5): SelfAttentionLayer( (self_attn): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True) ) (norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.0, inplace=False) ) (6): SelfAttentionLayer( (self_attn): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True) ) (norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.0, inplace=False) ) (7): SelfAttentionLayer( (self_attn): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True) ) (norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.0, inplace=False) ) (8): SelfAttentionLayer( (self_attn): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True) ) (norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.0, inplace=False) ) ) (transformer_cross_attention_layers): ModuleList( (0): CrossAttentionLayer( (multihead_attn): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True) ) (norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.0, inplace=False) ) (1): CrossAttentionLayer( (multihead_attn): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True) ) (norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.0, inplace=False) ) (2): CrossAttentionLayer( (multihead_attn): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True) ) (norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.0, inplace=False) ) (3): CrossAttentionLayer( (multihead_attn): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True) ) (norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.0, inplace=False) ) (4): CrossAttentionLayer( (multihead_attn): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True) ) (norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.0, inplace=False) ) (5): CrossAttentionLayer( (multihead_attn): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True) ) (norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.0, inplace=False) ) (6): CrossAttentionLayer( (multihead_attn): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True) ) (norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.0, inplace=False) ) (7): CrossAttentionLayer( (multihead_attn): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True) ) (norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.0, inplace=False) ) (8): CrossAttentionLayer( (multihead_attn): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=256, out_features=256, bias=True) ) (norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (dropout): Dropout(p=0.0, inplace=False) ) ) (transformer_ffn_layers): ModuleList( (0): FFNLayer( (linear1): Linear(in_features=256, out_features=2048, bias=True) (dropout): Dropout(p=0.0, inplace=False) (linear2): Linear(in_features=2048, out_features=256, bias=True) (norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True) ) (1): FFNLayer( (linear1): Linear(in_features=256, out_features=2048, bias=True) (dropout): Dropout(p=0.0, inplace=False) (linear2): Linear(in_features=2048, out_features=256, bias=True) (norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True) ) (2): FFNLayer( (linear1): Linear(in_features=256, out_features=2048, bias=True) (dropout): Dropout(p=0.0, inplace=False) (linear2): Linear(in_features=2048, out_features=256, bias=True) (norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True) ) (3): FFNLayer( (linear1): Linear(in_features=256, out_features=2048, bias=True) (dropout): Dropout(p=0.0, inplace=False) (linear2): Linear(in_features=2048, out_features=256, bias=True) (norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True) ) (4): FFNLayer( (linear1): Linear(in_features=256, out_features=2048, bias=True) (dropout): Dropout(p=0.0, inplace=False) (linear2): Linear(in_features=2048, out_features=256, bias=True) (norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True) ) (5): FFNLayer( (linear1): Linear(in_features=256, out_features=2048, bias=True) (dropout): Dropout(p=0.0, inplace=False) (linear2): Linear(in_features=2048, out_features=256, bias=True) (norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True) ) (6): FFNLayer( (linear1): Linear(in_features=256, out_features=2048, bias=True) (dropout): Dropout(p=0.0, inplace=False) (linear2): Linear(in_features=2048, out_features=256, bias=True) (norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True) ) (7): FFNLayer( (linear1): Linear(in_features=256, out_features=2048, bias=True) (dropout): Dropout(p=0.0, inplace=False) (linear2): Linear(in_features=2048, out_features=256, bias=True) (norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True) ) (8): FFNLayer( (linear1): Linear(in_features=256, out_features=2048, bias=True) (dropout): Dropout(p=0.0, inplace=False) (linear2): Linear(in_features=2048, out_features=256, bias=True) (norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True) ) ) (decoder_norm): LayerNorm((256,), eps=1e-05, elementwise_affine=True) (query_feat): Embedding(100, 256) (query_embed): Embedding(100, 256) (level_embed): Embedding(3, 256) (input_proj): ModuleList( (0): Sequential() (1): Sequential() (2): Sequential() ) (class_embed): Linear(in_features=256, out_features=41, bias=True) (mask_embed): MLP( (layers): ModuleList( (0): Linear(in_features=256, out_features=256, bias=True) (1): Linear(in_features=256, out_features=256, bias=True) (2): Linear(in_features=256, out_features=256, bias=True) ) ) ) ) (criterion): Criterion VideoSetCriterion matcher: Matcher VideoHungarianMatcher cost_class: 2.0 cost_mask: 5.0 cost_dice: 5.0 losses: ['labels', 'masks'] weight_dict: {'loss_ce': 2.0, 'loss_mask': 5.0, 'loss_dice': 5.0, 'loss_ce_0': 2.0, 'loss_mask_0': 5.0, 'loss_dice_0': 5.0, 'loss_ce_1': 2.0, 'loss_mask_1': 5.0, 'loss_dice_1': 5.0, 'loss_ce_2': 2.0, 'loss_mask_2': 5.0, 'loss_dice_2': 5.0, 'loss_ce_3': 2.0, 'loss_mask_3': 5.0, 'loss_dice_3': 5.0, 'loss_ce_4': 2.0, 'loss_mask_4': 5.0, 'loss_dice_4': 5.0, 'loss_ce_5': 2.0, 'loss_mask_5': 5.0, 'loss_dice_5': 5.0, 'loss_ce_6': 2.0, 'loss_mask_6': 5.0, 'loss_dice_6': 5.0, 'loss_ce_7': 2.0, 'loss_mask_7': 5.0, 'loss_dice_7': 5.0, 'loss_ce_8': 2.0, 'loss_mask_8': 5.0, 'loss_dice_8': 5.0} num_classes: 40 eos_coef: 0.1 num_points: 12544 oversample_ratio: 3.0 importance_sample_ratio: 0.75 ) [02/22 03:41:45 mask2former_video.data_video.dataset_mapper]: [DatasetMapper] Augmentations used in training: [ResizeShortestEdge(short_edge_length=(360, 480), max_size=1333, sample_style='choice_by_clip', clip_frame_cnt=2), RandomFlip(clip_frame_cnt=2)] [02/22 03:41:57 mask2former_video.data_video.datasets.ytvis]: Loading /data/bolu.ldz/DATASET/YoutubeVOS2019/train.json takes 12.59 seconds. [02/22 03:41:57 mask2former_video.data_video.datasets.ytvis]: Loaded 2238 videos in YTVIS format from /data/bolu.ldz/DATASET/YoutubeVOS2019/train.json [02/22 03:42:05 mask2former_video.data_video.build]: Using training sampler TrainingSampler [02/22 03:42:19 d2.data.common]: Serializing 2238 elements to byte tensors and concatenating them all ... [02/22 03:42:19 d2.data.common]: Serialized dataset takes 151.32 MiB [02/22 03:42:20 fvcore.common.checkpoint]: [Checkpointer] Loading from /data/bolu.ldz/PRETRAINED_WEIGHTS/mask2former/model_final_3c8ec9.pkl ... [02/22 03:42:22 fvcore.common.checkpoint]: Reading a file from 'MaskFormer Model Zoo' WARNING [02/22 03:42:22 mask2former_video.modeling.transformer_decoder.video_mask2former_transformer_decoder]: Weight format of VideoMultiScaleMaskedTransformerDecoder have changed! Please upgrade your models. Applying automatic conversion now ... WARNING [02/22 03:42:22 fvcore.common.checkpoint]: Skip loading parameter 'sem_seg_head.predictor.class_embed.weight' to the model due to incompatible shapes: (81, 256) in the checkpoint but (41, 256) in the model! You might want to double check if this is expected. WARNING [02/22 03:42:22 fvcore.common.checkpoint]: Skip loading parameter 'sem_seg_head.predictor.class_embed.bias' to the model due to incompatible shapes: (81,) in the checkpoint but (41,) in the model! You might want to double check if this is expected. WARNING [02/22 03:42:22 fvcore.common.checkpoint]: Skip loading parameter 'criterion.empty_weight' to the model due to incompatible shapes: (81,) in the checkpoint but (41,) in the model! You might want to double check if this is expected. WARNING [02/22 03:42:22 fvcore.common.checkpoint]: Some model parameters or buffers are not found in the checkpoint: criterion.empty_weight sem_seg_head.predictor.class_embed.{bias, weight} [02/22 03:42:22 d2.engine.train_loop]: Starting training from iteration 0 run on: autodrive DETECTRON2_DATASETS: /data/bolu.ldz/DATASET run on: autodrive DETECTRON2_DATASETS: /data/bolu.ldz/DATASET run on: autodrive DETECTRON2_DATASETS: /data/bolu.ldz/DATASET run on: autodrive DETECTRON2_DATASETS: /data/bolu.ldz/DATASET run on: autodrive DETECTRON2_DATASETS: /data/bolu.ldz/DATASET run on: autodrive DETECTRON2_DATASETS: /data/bolu.ldz/DATASET run on: autodrive DETECTRON2_DATASETS: /data/bolu.ldz/DATASET error in ms_deformable_im2col_cuda: no kernel image is available for execution on the device error in ms_deformable_im2col_cuda: no kernel image is available for execution on the device

facebookresearch / Mask2Former

error in ms_deformable_im2col_cuda: no kernel image is available for execution on the device #53

54 seems to find the problem