vkQueueSubmit failed on most images larger than 512px

chaiNNer-org / chaiNNer

A node-based image processing GUI aimed at making chaining image processing tasks easy and customizable. Born as an AI upscaling application, chaiNNer has grown into an extremely flexible and powerful programmatic image processing application.

https://chaiNNer.app

GNU General Public License v3.0

4.44k stars 278 forks source link

vkQueueSubmit failed on most images larger than 512px #913

Open Jason-Bloomer opened 2 years ago

Jason-Bloomer commented 2 years ago

Information:

Chainner version: 12.2 Alpha
OS: Windows 10

Description Errors occur after processing images with iteration, or when processing a single image by itself. Usually only happens when the image is larger than 512 pixels on at least one dimension. If I downscale the images first so they are never larger than 512px in any direction I can process all images without problem. Sorry if this is a duplicate issue reported elsewhere. I looked and saw several people reporting similar problems, but with slightly different context, and mostly on MacOS. This is occurring for me on Windows10.

An error occurred in a Image File Iterator node:

Errors occurred during iteration: 
• An unexpected error occurred during NCNN processing.
• vkQueueSubmit failed

Logs renderer.log main.log

joeyballentine commented 2 years ago

This seems to happen on some AMD GPUs when you get an out of memory error. The way auto tiling works is by detecting these out of memory errors, tiling the image, and trying again. However, on these GPUs, going out of memory crashes the graphics driver and therefore any subsequent attempt by chaiNNer to upscale results in an error.

I'm not sure how to resolve this for auto tiling, but you can for sure fix it by selecting a larger number of tiles (so that each tile would be no more than 512x)

Jason-Bloomer commented 2 years ago

I do not have an AMD GPU, I am running an Nvidia RTX 3060. If I select any option other than Auto, the program also errors out on those same images, with the same error indeed. I've tried every available value and none of them seem to do anything differently. All the same images are not processed.

EDIT: As a side note- I have a pre-silicon-nerf 3060 with 12GB VRAM, not the 8GB the newer Ti's ship with.

My GPU memory (according to task manager) doesn't seem to budge when running this program. GPU usage peaks at around 45%, never goes beyond that. My machine has plenty of power to spare, I don't understand why this would be an out of memory issue, unless it's limited to 2GB? When I use NCNN with the exact same model through a .bat file, I'm able to set the tile size in pixels, I usually have it set to 32 (minimum) and it works fine with all the same images in question. Doesn't necessarily eat up any additional VRAM, but the GPU usage is a bit higher.

here is during the operation, this was consistent Capture1

and after the operation has concluded Capture2

joeyballentine commented 2 years ago

A 32 px tile size would mean it's processing in just 32x32 chunks of the image, which is really small, hence the low VRAM usage. I'm planning on implementing a more traditional tile size algorithm I just have been procrastinating on it. The current system re-uses my auto-tiling code.

Anyway, this is the first time I've seen this error with an Nvidia card. Is there any reason you aren't using PyTorch btw?

My GPU memory (according to task manager) doesn't seem to budge when running this program.

With 12GB of VRAM you should be able to easily upscale at the very least a 1000x image without even needing tiling. What model are you even using?

joeyballentine commented 2 years ago

Hold up: Do you have an integrated graphics card? If you do, I think NCNN is trying to use that instead of your Nvidia card, which would explain both the error as well as the fact that it can't go past 512x

Jason-Bloomer commented 2 years ago

Is there any reason you aren't using PyTorch btw? I'm attempting to use some NCNN models from other projects, which I like the results of. I've tried quite a few but they all seem to have the same problem. ReaESRGAN-x4plus is the model I'm currently using. There's no options to convert NCNN to anything else so.. there it stays.

I have found possibly a part of the issue. I had CUDA Toolkit 10.2 still installed, never updated it since it still worked for the projects I've been using. Updated to CUDA Toolkit 11.6 and tried to reinstall PyTorch to get it to recompile, both in anaconda on my system, and in chaiNNer itself since it looks like it uses its own integrated version. However it's still reporting cu113, when it should be built for cu116.

The GPU usage fluctuated a lot more but the same errors still occur on the same images.

I'm sure my system probably has IGFX, I'll try and rerun a queue and monitor its performance to see if it's doing anything.

joeyballentine commented 2 years ago

ReaESRGAN-x4plus

Here's PyTorch versions of all the RealESRGAN models

However it's still reporting cu113

That's just the version of CUDA that comes with PyTorch. It doesn't use the CUDA toolkit you have installed on your system. Why? Ask the PyTorch team, but it's the reason PyTorch is 3GB.

reinstall PyTorch to get it to recompile

Installing PyTorch from pip does not compile it. You are downloading a pre-built wheel file.

Also, CUDA & PyTorch have nothing to do with NCNN. NCNN is entirely separate and uses Vulkan for processing.

If your system does have integrated graphics, that would explain the problem. I'm currently working on putting in a GPU selector into settings to hopefully allow you to pick your Nvidia GPU for NCNN and get rid of this issue. But, since you're just using RealESRGAN you also can just use the PyTorch version, and PyTorch will definitely use your GPU.

joeyballentine commented 1 year ago

@Jason-Bloomer would you mind testing this build and seeing if the error still happens? I changed a small thing about how NCNN allocates stuff. It might fix the issue but I kinda doubt it. https://cdn.discordapp.com/attachments/930865463318179952/1017897945271631922/chaiNNer-win32-x64-0.12.3.zip

Jason-Bloomer commented 1 year ago

Error still occurs, with both versions, the 0.12.3 you posted above and the 0.12.4 current that has gpu selection. My Nvidia GPU shows as the only selectable option (GPU 0) but the images still do not process, and I still get the same long laundry list of vkqueuesubmit failures.

It's really not a huge deal for me, as I can still use the models themselves through the batch command normally as I always have.

And to be honest, I have no idea how most of this stuff works under the hood, though I'm trying to learn. I was more or less hoping it was user error and something stupid I had done or misconfigured on my end that would result in an easy fix.

nihui commented 1 year ago

https://github.com/HomeOfVapourSynthEvolution/VapourSynth-RIFE-ncnn-Vulkan/issues/2 https://github.com/xinntao/Real-ESRGAN/issues/106

see the above for workaround TDR on windows

@Jason-Bloomer @joeyballentine

joeyballentine commented 1 year ago

@nihui thanks for the suggestion!

@Jason-Bloomer please let me know if that fix works

joeyballentine commented 1 year ago

@Jason-Bloomer Is this still an issue? I think with the estimation it should be erroring far less, if at all now.

Jason-Bloomer commented 1 year ago

Sorry for the delay but I haven't really had time to mess with this recently. Just updated to the most recent version (0.15.3) and, while I am no longer getting the "vkQueueSubmit" errors, I am, still, getting an error:


An error occurred in a Image File Iterator node:

Errors occurred during iteration: 
• A critical error has occurred. You may need to restart chaiNNer in order for NCNN upscaling to start working again.

It now seems to produce appropriately-sized but all-black images for the images it was previously erroring on.

joeyballentine commented 1 year ago

Sorry for the late reply.

I don't think there's really anything else I can do about this. I tried my best to work around NCNN's issues, but it seems to just hate some people's systems for some reason.

joeyballentine commented 1 year ago

@SpaceMageWhatever

is this not fixed yet? i used to be able to upscale everything, then everything just, randomly broke for no reason, sometimes i can get things to upscale but the images have random black squares, most of the time it just, randomly fails, its super frustrating as it used to work fine

I can't do anything to fix this. It's an inherent problem with ncnn, Vulkan, and your GPU. If using the smallest possible tile size doesn't do it, then you're just out of luck