MCG-NKU / E2FGVI

Official code for "Towards An End-to-End Framework for Flow-Guided Video Inpainting" (CVPR2022)
Other
1.02k stars 97 forks source link

Hi about the memory error #29

Open tchen0623 opened 2 years ago

tchen0623 commented 2 years ago

When I was trying to run my own video, it meet the problem of memory.

RuntimeError: CUDA out of memory. Tried to allocate 1.62 GiB (GPU 0; 8.00 GiB total capacity; 5.05 GiB already allocated; 0 bytes free; 7.01 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Frame number and video are small than the tennis or school girl demo. I am able to run those two demo, but not able to run my own demo.

Paper99 commented 2 years ago

Could you tell me the spatial resolution of your video?

tchen0623 commented 2 years ago

Hi, it's 720p. And I am able to run my own video by changing the setting of --step in test.py But the result is terrible.

Paper99 commented 2 years ago

We use a GPU with 48G memory to process 720p video. The setting of --step does affect the performance. To check the influence of this parameter, you could first keep the original settings and use e2fgvi to process the downscaled video.

Teravus commented 1 year ago

At least for me, my problem was number of frames, given that they were all loaded in vram. I solved it with some trivial code changes that batched the frames to inference.. which worked so long as the batch size is evenly divisible by neighbor_stride.

In this way, results do not suffer and VRAM is reduced. Is this something that you would like as a pull request @Paper99 ? (note for myself, if you say yes. This edit is on my D drive under the E2FGVI folder)

firebeasty commented 1 year ago

At least for me, my problem was number of frames, given that they were all loaded in vram. I solved it with some trivial code changes that batched the frames to inference.. which worked so long as the batch size is evenly divisible by neighbor_stride.

In this way, results do not suffer and VRAM is reduced. Is this something that you would like as a pull request @Paper99 ? (note for myself, if you say yes. This edit is on my D drive under the E2FGVI folder)

Pretty much zero experience in this rodeo @Teravus, but any chance you could share specifically how you set up the batching? I'm definitely hitting that same limit on my own samples. Thanks!

Teravus commented 1 year ago

At least for me, my problem was number of frames, given that they were all loaded in vram. I solved it with some trivial code changes that batched the frames to inference.. which worked so long as the batch size is evenly divisible by neighbor_stride. In this way, results do not suffer and VRAM is reduced. Is this something that you would like as a pull request @Paper99 ? (note for myself, if you say yes. This edit is on my D drive under the E2FGVI folder)

Pretty much zero experience in this rodeo @Teravus, but any chance you could share specifically how you set up the batching? I'm definitely hitting that same limit on my own samples. Thanks!

Here's my Test.py in the zip.
test.zip

The main thing is.. create a stride.. then loop over the stride.. putting only the frames for this stride in VRAM at a time.
Then, next stride, replace those images in VRAM with new images.

That way they're not all in VRAM at once.

With the frames and masks;

    x_frames = [rframes[i:i + framestride] for i in range(0, len(rframes), framestride)]
    x_masks = [rmasks[i:i + framestride] for i in range(0, len(rmasks), framestride)]

Look for the line that says; framestride = 200.

You can change that to more or less depending on your VRAM. If you lower it, the process will take less VRAM. If you raise it. The process will take more VRAM.

A Couple of things to note:

Hopefully this is helpful.

firebeasty commented 1 year ago

At least for me, my problem was number of frames, given that they were all loaded in vram. I solved it with some trivial code changes that batched the frames to inference.. which worked so long as the batch size is evenly divisible by neighbor_stride. In this way, results do not suffer and VRAM is reduced. Is this something that you would like as a pull request @Paper99 ? (note for myself, if you say yes. This edit is on my D drive under the E2FGVI folder)

Pretty much zero experience in this rodeo @Teravus, but any chance you could share specifically how you set up the batching? I'm definitely hitting that same limit on my own samples. Thanks!

Here's my Test.py in the zip. test.zip

The main thing is.. create a stride.. then loop over the stride.. putting only the frames for this stride in VRAM at a time. Then, next stride, replace those images in VRAM with new images.

That way they're not all in VRAM at once.

With the frames and masks;

    x_frames = [rframes[i:i + framestride] for i in range(0, len(rframes), framestride)]
    x_masks = [rmasks[i:i + framestride] for i in range(0, len(rmasks), framestride)]

Look for the line that says; framestride = 200.

You can change that to more or less depending on your VRAM. If you lower it, the process will take less VRAM. If you raise it. The process will take more VRAM.

A Couple of things to note:

  • The framestride must be divisible evenly by the neighbor_stride. The neighbor_stride is 5 by default but you can change it on the command line. By Default: Any number that you can divide by 5.
  • When you're processing, it treats each stride worth of images as its own process. So you'll see the statusbar go from the start to end multiple times depending on your framestride. Let's say that you have 750 frames. If you set the framestride to 200, you'll see it go from 0 to full 4 times. The last stride will be less.
  • Across all of the strides, It outputs the results of each frame into comp_frames, which then gets turned into your video.
  • I didn't update the little video player UI that comes up after the video is complete. It will display only part of your completed video. but your output video file will be the complete video.
  • I've only tested this change when you are using a directory of individual frames and have not tested it on a video input file directly. It's possible it might work, but I have not tested it. I use ffmpeg to extract the frames.

Hopefully this is helpful.

This was really helpful! I got it working great! I had to debug your cuda:1 line to cuda since I only have 1 gpu, but otherwise, worked great with some minor parameter tweaking.

I am curious if there's a way to have it temporally link up between batches - I do notice some minor pops where the strides are broken up. Is there some way you could maintain the data from the last frame before going to the next stride so it has some temporal consistency between them? -- I understand completely that that's a huge wish item, but I have to ask for selfish reasons! πŸ˜… Again, thank you so much for getting me this far! Your fix in the other thread on setting up the windows environment from 3 weeks ago was insanely helpful after failing at getting this going a month ago!

Teravus commented 1 year ago

At least for me, my problem was number of frames, given that they were all loaded in vram. I solved it with some trivial code changes that batched the frames to inference.. which worked so long as the batch size is evenly divisible by neighbor_stride. In this way, results do not suffer and VRAM is reduced. Is this something that you would like as a pull request @Paper99 ? (note for myself, if you say yes. This edit is on my D drive under the E2FGVI folder)

Pretty much zero experience in this rodeo @Teravus, but any chance you could share specifically how you set up the batching? I'm definitely hitting that same limit on my own samples. Thanks!

Here's my Test.py in the zip. test.zip The main thing is.. create a stride.. then loop over the stride.. putting only the frames for this stride in VRAM at a time. Then, next stride, replace those images in VRAM with new images. That way they're not all in VRAM at once. With the frames and masks;

    x_frames = [rframes[i:i + framestride] for i in range(0, len(rframes), framestride)]
    x_masks = [rmasks[i:i + framestride] for i in range(0, len(rmasks), framestride)]

Look for the line that says; framestride = 200. You can change that to more or less depending on your VRAM. If you lower it, the process will take less VRAM. If you raise it. The process will take more VRAM. A Couple of things to note:

  • The framestride must be divisible evenly by the neighbor_stride. The neighbor_stride is 5 by default but you can change it on the command line. By Default: Any number that you can divide by 5.
  • When you're processing, it treats each stride worth of images as its own process. So you'll see the statusbar go from the start to end multiple times depending on your framestride. Let's say that you have 750 frames. If you set the framestride to 200, you'll see it go from 0 to full 4 times. The last stride will be less.
  • Across all of the strides, It outputs the results of each frame into comp_frames, which then gets turned into your video.
  • I didn't update the little video player UI that comes up after the video is complete. It will display only part of your completed video. but your output video file will be the complete video.
  • I've only tested this change when you are using a directory of individual frames and have not tested it on a video input file directly. It's possible it might work, but I have not tested it. I use ffmpeg to extract the frames.

Hopefully this is helpful.

This was really helpful! I got it working great! I had to debug your cuda:1 line to cuda since I only have 1 gpu, but otherwise, worked great with some minor parameter tweaking.

I am curious if there's a way to have it temporally link up between batches - I do notice some minor pops where the strides are broken up. Is there some way you could maintain the data from the last frame before going to the next stride so it has some temporal consistency between them? -- I understand completely that that's a huge wish item, but I have to ask for selfish reasons! πŸ˜… Again, thank you so much for getting me this far! Your fix in the other thread on setting up the windows environment from 3 weeks ago was insanely helpful after failing at getting this going a month ago!

Glad it was helpful. Sorry about the cuda:1 thing. I have two GPUs and sent everything to the GPU that I wasn't already using for something.

To answer your question about coherency between batches.. Probably yes. However, we'd need to write a special neighbor routine for [neighbor_stride] frames at the end and beginning of batches..

In my test video that I'm using, I didn't see any flashes but it's a very limited use case... with a mask that stays roughly the same for each frame. I bet if I clone one of the sample videos and reverse it, then append it to the original... then I'd see flashes also.

firebeasty commented 1 year ago

I'm probably pushing the use case of this really far - without saying too much I'm trying to remove tracking marks on a face that's pretty smooth with little motion. The pops between strides becomes pretty noticeable for areas of sharp contrast. It's almost not noticeable! But still very much so! I only have 12GB card too so pushing in 960x480 frames, I'm lucky to get 24-frame-long strides. A pop every 5-10 seconds wouldn't be that bad, but every second is something else! πŸ˜… Ideally I'd be able to push a little bit more resolution in so skin texture can be preserved, but I'm still quite impressed for the first go at it - trying to think how to limit stride size while maintaining temporal consistency. Again, basically zero experience in this - thanks so much for entertaining my questions!

On Wed, Jan 4, 2023 at 5:07 PM Teravus @.***> wrote:

At least for me, my problem was number of frames, given that they were all loaded in vram. I solved it with some trivial code changes that batched the frames to inference.. which worked so long as the batch size is evenly divisible by neighbor_stride. In this way, results do not suffer and VRAM is reduced. Is this something that you would like as a pull request @Paper99 https://github.com/Paper99 ? (note for myself, if you say yes. This edit is on my D drive under the E2FGVI folder)

Pretty much zero experience in this rodeo @Teravus https://github.com/Teravus, but any chance you could share specifically how you set up the batching? I'm definitely hitting that same limit on my own samples. Thanks!

Here's my Test.py in the zip. test.zip https://github.com/MCG-NKU/E2FGVI/files/10340822/test.zip The main thing is.. create a stride.. then loop over the stride.. putting only the frames for this stride in VRAM at a time. Then, next stride, replace those images in VRAM with new images. That way they're not all in VRAM at once. With the frames and masks;

x_frames = [rframes[i:i + framestride] for i in range(0, len(rframes), framestride)]

x_masks = [rmasks[i:i + framestride] for i in range(0, len(rmasks), framestride)]

Look for the line that says; framestride = 200. You can change that to more or less depending on your VRAM. If you lower it, the process will take less VRAM. If you raise it. The process will take more VRAM. A Couple of things to note:

  • The framestride must be divisible evenly by the neighbor_stride. The neighbor_stride is 5 by default but you can change it on the command line. By Default: Any number that you can divide by 5.
  • When you're processing, it treats each stride worth of images as its own process. So you'll see the statusbar go from the start to end multiple times depending on your framestride. Let's say that you have 750 frames. If you set the framestride to 200, you'll see it go from 0 to full 4 times. The last stride will be less.
  • Across all of the strides, It outputs the results of each frame into comp_frames, which then gets turned into your video.
  • I didn't update the little video player UI that comes up after the video is complete. It will display only part of your completed video. but your output video file will be the complete video.
  • I've only tested this change when you are using a directory of individual frames and have not tested it on a video input file directly. It's possible it might work, but I have not tested it. I use ffmpeg to extract the frames.

Hopefully this is helpful.

This was really helpful! I got it working great! I had to debug your cuda:1 line to cuda since I only have 1 gpu, but otherwise, worked great with some minor parameter tweaking.

I am curious if there's a way to have it temporally link up between batches - I do notice some minor pops where the strides are broken up. Is there some way you could maintain the data from the last frame before going to the next stride so it has some temporal consistency between them? -- I understand completely that that's a huge wish item, but I have to ask for selfish reasons! πŸ˜… Again, thank you so much for getting me this far! Your fix in the other thread on setting up the windows environment from 3 weeks ago was insanely helpful after failing at getting this going a month ago!

Glad it was helpful. Sorry about the cuda:1 thing. I have two GPUs and sent everything to the GPU that I wasn't already using for something.

To answer your question about coherency between batches.. Probably yes. However, we'd need to write a special neighbor routine for [neighbor_stride] frames at the end and beginning of batches..

In my test video that I'm using, I didn't see any flashes but it's a very limited use case... with a mask that stays roughly the same for each frame. I bet if I clone one of the sample videos and reverse it, then append it to the original... then I'd see flashes also.

β€” Reply to this email directly, view it on GitHub https://github.com/MCG-NKU/E2FGVI/issues/29#issuecomment-1371565631, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAVHEC7AZZQ77IWAOFBZ4ALWQYGFLANCNFSM56ONIJKQ . You are receiving this because you commented.Message ID: @.***>

Teravus commented 1 year ago

@firebeasty I've been playing with the major parameters... and came to the conclusion that a higher neighbor_stride works better on a smaller number of frames.

If you're going to use 24 frame long strides, try --neighbor_stride 10 instead of the default neighbor_stride 5. I'd also set the strides to 20 instead of 24 so it is a multiple of neighbor_stride. Try --neighbor_stride 10 via the command line and framestride 20 in the code.

This seems to reduce the skipping/flaring for me.