facebookresearch / Mask2Former

Code release for "Masked-attention Mask Transformer for Universal Image Segmentation"
MIT License
2.48k stars 379 forks source link

Demo of video instance segmentation #78

Open kl2005ad opened 2 years ago

kl2005ad commented 2 years ago

I was using demo_video/demo.py to run VIS inference with youtubeVIS 2019/2021 models on some videos of resolution 960x540. I got these errors:

  1. For bigger models that use swin, it gives me out of memory issue, although I'm using V100 with 32G mem. "RuntimeError: CUDA out of memory..."
  2. For smaller models such as R50 or R101, it gives me this error "RuntimeError: cuDNN error: CUDNN_STATUS_NOT_SUPPORTED. This error may appear if you passed in a non-contiguous input."

Just FYI, I can run the largest panoptic seg models (swin-large) without GPU mem issue. So probably it is not a environment problem? Any suggestion is appreciated.

niemiaszek commented 2 years ago

I'm having a similar issue with OOM even on R50. I can also run Swin L with panoptic segmentation. My understanding (having a quick glance at Tech Report regarding VIS with this architecture) is that usage of 3D features etc. for videos might require quite much memory and it doesn't scale well with length of video. As I have only RTX 3080 (10G) in my personal workstation, on ~8.5 G of free memory I ran successfully up to 45 frames of 448x256 videos on R50 (they are getting clipped to some size on higher resolutions if I remember well). I also could run up to 15 frames on Swim L. I'm wondering if It is anyhow possible to run inference with this architecture on longer videos besides getting more VRAM? @kl2005ad Have you found any way to resolve your issue? Could we get a reference on memory needed to run VIS on videos, please? The only "fast workaround" that I have in my mind is splitting all recordings into batches that can fit into memory and then applying some post-processing to connect masks, maybe introduce some overlap in batching for better object reference. This is obviously ruining whole video idea proposed in article

kl2005ad commented 2 years ago

@niemiaszek On further experiment I also found that inference can work on a really short clip. For swin-large model it can handle somewhere between 100~150 frames one time on V100 GPU with 32G mem. So indeed GPU memory consumption is at least part of the problem.

bowenc0221 commented 2 years ago

Are you able to run the demo on CPU?

niemiaszek commented 2 years ago

@bowenc0221 Thank you for your reply and overall maintenance! It fits quite well into 32G RAM. Is it possible to run mutli-gpu inference? I guess it might not be possible within one recording. The biggest single GPU I could get is RTX 8000 wtih 48G, but I could run a node of them. @kl2005ad Thanks for additional reference on memory usage!

kl2005ad commented 2 years ago

@bowenc0221 Why do we want to run without GPU? I give it a try and it gets me such error: File "/user/dev/Mask2Former/demo_video/../mask2former/modeling/pixel_decoder/ops/functions/ms_deform_attn_func.py", line 36, in forward output = MSDA.ms_deform_attn_forward( RuntimeError: Not implemented on the CPU

niemiaszek commented 2 years ago

@kl2005ad I guess the reason is if you have more RAM for your CPU than VRAM in your GPU (just like in my case, 32G vs 10G), then you might load longer videos (even tho its way slower). For me it worked on CPU, did you follow this setup? I followed Example conda environment setup with set of drivers from CUDA 11.4 and cudnn 8.2 on my host and I can run R50/100 backbones and everything on CPU

kl2005ad commented 2 years ago

@niemiaszek I followed the official install process. CPU is not an option for me anyway. Thanks for your info, though. Now we know that the inference can run on GPU, just constrained to very short video clips.

zzhmx commented 1 year ago

I think it's an official procedural issue