Netflix / vmaf

Perceptual video quality assessment based on multi-method fusion.
Other
4.65k stars 755 forks source link

[#1381] Change cuMemFreeAsync to cuMemFree in vmaf_cuda_picture_free() #1382

Open ha7sh17 opened 4 months ago

ha7sh17 commented 4 months ago

This PR is a fix to #1381

Description

ulimit -v 16777216;ffmpeg -i a.mp4 -i b.mp4 -filter_complex "[0:v]hwupload_cuda[main];[1:v]hwupload_cuda[ref];[main][ref]libvmaf_cuda=log_fmt=json:log_path=vmaf_log.json" -f null -
code: 2; description: CUDA_ERROR_OUT_OF_MEMORY
ffmpeg: ../src/cuda/picture_cuda.c:226: vmaf_cuda_picture_free: Assertion `0' failed.
Aborted (core dumped)

According to the API documentation, cuMemFreeAsync() does not return CUDA_ERROR_OUT_OF_MEMORY. However, in this abnormal situation, it is returning CUDA_ERROR_OUT_OF_MEMORY.

We have confirmed that using the synchronous memory free API, cuMemFree(), resolves the issue. We propose modifying the code to use cuMemFree() for freeing memory until the underlying cause of the issue with the CUDA asynchronous API is resolved.

nilfm99 commented 4 months ago

Tagging @gedoensmax

gedoensmax commented 4 months ago

When does this crash occur ? During processing or only after processing for a while ? When using CUDA the internally preallocated pictures should be used in FFmpeg so that this free should only be called after processing all the pictures when closing the context. EDIT: Could you check if this reproduces with the env variable CUDA_LAUNCH_BLOCKING=1 set ?

ha7sh17 commented 4 months ago

Dear @gedoensmax

When does this crash occur ? During processing or only after processing for a while ? When using CUDA the internally preallocated pictures should be used in FFmpeg so that this free should only be called after processing all the pictures when closing the context.

The VMAF score is output correctly, but an assertion occurs when ffmpeg exits.

Reproducing the issue is very easy. I can reproduce it 100% by setting the virtual memory with ulimit -v and using the libvmaf_cuda filter with ffmpeg on an CentOS 7 & Nvidia T4 server

The key point for reproducing the issue is setting the virtual memory limit using ulimit -v.

ulimit -v 16777216;ffmpeg -i a.mp4 -i b.mp4 -filter_complex "[0:v]hwupload_cuda,scale_npp=1920:1080:format=yuv420p[main];[1:v]hwupload_cuda,scale_npp=1920:1080:format=yuv420p[ref];[main][ref]libvmaf_cuda=log_fmt=json:log_path=vmaf_log.json" -f null -
[Parsed_libvmaf_cuda_4 @ 0x5273d80] VMAF score: 99.723612
code: 2; description: CUDA_ERROR_OUT_OF_MEMORY
ffmpeg: ../src/cuda/picture_cuda.c:226: vmaf_cuda_picture_free: Assertion `0' failed.
Aborted

EDIT: Could you check if this reproduces with the env variable CUDA_LAUNCH_BLOCKING=1 set ?

I followed your guide and set the environment variables as shown below, but the same issue continues to occur.

export CUDA_LAUNCH_BLOCKING=1;ulimit -v 16777216;ffmpeg -i a.mp4 -i b.mp4 -filter_complex "[0:v]hwupload_cuda,scale_npp=1920:1080:format=yuv420p[main];[1:v]hwupload_cuda,scale_npp=1920:1080:format=yuv420p[ref];[main][ref]libvmaf_cuda=log_fmt=json:log_path=vmaf_log.json" -f null -
[Parsed_libvmaf_cuda_4 @ 0x5273d80] VMAF score: 99.723612
code: 2; description: CUDA_ERROR_OUT_OF_MEMORY
ffmpeg: ../src/cuda/picture_cuda.c:226: vmaf_cuda_picture_free: Assertion `0' failed.
Aborted
gedoensmax commented 4 months ago

Ok so when using the internal memory pool correctly it should not have any performance impact to move to the synchronous version of this API. In case this is not possible though and pictures are allocated and free'd dynamically it will introduce a CUDA synchronization. Is there any reason that the virtual men limit is set that low ? The driver will grow a memory pool for these async allocations as far as I understand which requires some virtual addresses.

ha7sh17 commented 4 months ago

Is there any reason that the virtual men limit is set that low ?

We were limiting the virtual memory before running the ffmpeg process due to an unresolved memory leak issue in ffmpeg. By doing this, only the specific ffmpeg process would terminate abnormally without affecting the entire system. If we do not limit the virtual memory, the ffmpeg process would continue to consume virtual memory, eventually causing the system to hang. (Of course, this is not a common situation but a very special case, and you can think of it as a preventive measure for such special cases.)

Ok so when using the internal memory pool correctly it should not have any performance impact to move to the synchronous version of this API. In case this is not possible though and pictures are allocated and free'd dynamically it will introduce a CUDA synchronization.

We understand what you mean. If you, like us, must limit the virtual memory, how many GB would you set the virtual memory to ensure that ffmpeg + libvmaf_cuda operates correctly? I understand that this is not an easy question to answer. :)

gedoensmax commented 4 months ago

As you are saying this is not an easy questionand i think if your change is fixing the problem you shall use it. Do you need the change to be in main branch though ? Or maybe you can expose the change to synchronous free through an option ?

ha7sh17 commented 4 months ago

As you mentioned, we can apply this fix only in our branch and it does not need to be applied to the main branch. We just wanted to inform you about this issue as dedicated users of ffmpeg + libvmaf_cuda. Therefore, it is okay to close this PR without merging. Thank you for your kind response.