Failed renders when using queue, work fine in preview, GPU related

Describe the bug Random corruption of faces when processing via the queue while using NVIDIA 4090 Driver 555.85 with no ECC and Sysmem Fallback Policy = Prefer Sysmen Fallback AND 'Prefer No Sysmem Fallback' tested

Problem lessens/worsens when threads less/more (tested 1/3/11/26 threads) on 28 thread system.

No problem seen when ALL options set to use CPU: 'Select video processing method' = Extract Frames to media 'Provider' = cpu 'Force CPU for Face Analyzer' = TRUE

NOTE: ONLY 'Force CPU for Face Analyser' is insufficient to duplicate issue.

To Reproduce Steps to reproduce the behavior: Load faceset. Have used multiple facesets with 9 to 288 images each. No change in error production. Load the 'Target File' queue with multiple (in my testing always >30, most ~60, largest set ~300) images. No change in error production. Verify preview image is OK. Open output folder, Start queue. Queue processes and completes - multiple images fail with either total noise in the shape of the replacement mask, or what looks like a faceswap image in incorrect perspective with visible discontinuity at mask boundary of swapped area. General area of swap mask placement seems to be correct.

Go to 'Settings' --> 'Max. number of threads' = 11 'Image Output Format' change to jpg --> 'Apply Settings' Problem reproduces, yet with some images previously affected unaffected and others now affected. Return to preview, select any image that failed (any and all work) Refresh image, image produces OK, download image, image is perfectly fine. Go to 'Settings' --> 'Image Output Format' change to png --> 'Apply Settings' Problem reproduces, yet with some images previously affected unaffected and others now affected. Return to preview, select any image that failed (any and all work) Refresh image, image produces OK, download image, image is perfectly fine. Go to 'Settings' --> 'Image Output Format' change (back to default) to webp --> 'Apply Settings' Problem reproduces, yet with some images previously affected unaffected and others now affected. Return to preview, select any image that failed (any and all work) Refresh image, image produces OK, download image, image is perfectly fine.

Go to 'Settings' --> 'Force CPU for Face Analyser' --> 'Apply Settings' Go to 'Settings' --> 'Image Output Format' change to jpg --> 'Apply Settings' Problem reproduces, yet with some images previously affected unaffected and others now affected. Return to preview, select any image that failed (any and all work) Refresh image, image produces OK, download image, image is perfectly fine. Go to 'Settings' --> 'Image Output Format' change to png --> 'Apply Settings' Problem reproduces, yet with some images previously affected unaffected and others now affected. Return to preview, select any image that failed (any and all work) Refresh image, image produces OK, download image, image is perfectly fine. Go to 'Settings' --> 'Image Output Format' change (back to default) to webp --> 'Apply Settings' Problem reproduces, yet with some images previously affected unaffected and others now affected. Return to preview, select any image that failed (any and all work) Refresh image, image produces OK, download image, image is perfectly fine.

Go to 'Settings' --> UNSELECT 'Force CPU for Face Analyser' UNSELECT 'Use default Det-Size' --> 'Apply Settings'

Perform some of tests above - same problem, this is taking a lot of time, thus not attempting ALL of them again!

Realize that I have modified default thread count at start (Go to 'Settings' --> 'Max. number of threads' = 11)

Test to see if default thread count has issue.

Run with following config: Intel(R) Core(TM) i9-10940X CPU @ 3.30GHz = 14 Core / 28 Thread 256 GB RAM NVIDIA 4090 / DCH Driver = 555.85 Batch Size = 89 (179 MB) Provider = cuda Use Default Det-Size = TRUE Force CPU for Face Analyser = FALSE Max. Number of threads = X Max. Memory to use (Gb) 0 meaning no limit = 0 Image Output Format = jpg

Run with X = 1 = One (1) Error Run with X = 3 = Two (2) Errors Run with X = 11 = Seventeen (17) Errors (GPU-Z 2.59.0 reports 24420MB GPU Memory Used) Run with X = 26 = Fifty Four (54) (Memory seems to be exceeding maximum thus Sysmem fallback as per driver policy) Logging Window from run with Sysmem fallback: Sorting videos/images Processing image(s) Processing: 100%|████████████████████████| 89/89 [01:27<00:00, 1.01frame/s, memory_usage=01.87GB, execution_threads=1] Finished Sorting videos/images Processing image(s) Processing: 100%|████████████████████████| 89/89 [00:48<00:00, 1.82frame/s, memory_usage=01.97GB, execution_threads=3] Finished Sorting videos/images Processing image(s) Processing: 100%|███████████████████████| 89/89 [00:25<00:00, 3.51frame/s, memory_usage=02.48GB, execution_threads=11] Finished Sorting videos/images Processing image(s) Processing: 100%|███████████████████████| 89/89 [06:44<00:00, 4.54s/frame, memory_usage=26.10GB, execution_threads=26] Finished

Modify Driver Configuration to 'CUDA - Sysmem Fallback Policy' = 'Prefer No Sysmem Fallback' Run with X = 1 = One (1) Error --> Note log says 3 threads ran anyway Run with X = 3 = Three (3) Errors Run with X = 11 = Thirty Six (36) Errors (GPU-Z 2.59.0 reports 24421MB GPU Memory Used), note 25 minute run time below Run with X = 26 = 2024-06-08 20:46:02.1795191 [E:onnxruntime:, sequential_executor.cc:514 onnxruntime::ExecuteKernel] Non-zero status code returned while running Conv node. Name:'/blocks.3/conv/Conv' Status Message: D:\a_work\1\s\onnxruntime\core\framework\bfc_arena.cc:376 onnxruntime::BFCArena::AllocateRawInternal Failed to allocate memory for requested buffer of size 16777216 (This makes perfect sense)

Logging Window from run with no Sysmem fallback: Processing image(s) Processing: 100%|████████████████████████| 89/89 [00:40<00:00, 2.22frame/s, memory_usage=02.90GB, execution_threads=3] Finished Sorting videos/images Processing image(s) Processing: 100%|████████████████████████| 89/89 [00:39<00:00, 2.24frame/s, memory_usage=02.88GB, execution_threads=3] Finished Sorting videos/images Processing image(s) Processing: 100%|███████████████████████| 89/89 [24:34<00:00, 16.57s/frame, memory_usage=07.53GB, execution_threads=11] Finished 2024-06-08 20:48:00.6939531 [E:onnxruntime:, sequential_executor.cc:514 onnxruntime::ExecuteKernel] Non-zero status code returned while running Conv node. Name:'/blocks.3/conv/Conv' Status Message: D:\a_work\1\s\onnxruntime\core\framework\bfc_ERROR:asyncio:Exception in callback _ProactorBasePipeTransport._call_connection_lost(None) handle: <Handle _ProactorBasePipeTransport._call_connection_lost(None)> Traceback (most recent call last): File "E:\roop-unleashed\installer_files\env\lib\asyncio\events.py", line 80, in _run self._context.run(self._callback, *self._args) File "E:\roop-unleashed\installer_files\env\lib\asyncio\proactor_events.py", line 165, in _call_connection_lost self._sock.shutdown(socket.SHUT_RDWR) ConnectionResetError: [WinError 10054] An existing connection was forcibly closed by the remote host arena.cc:376 onnxruntime::BFCArena::AllocateRawInternal Failed to allocate memory for requested buffer of size 16777216 (and other such errors, etc)

OK, now a bit confused. Doesn't this happen with CPU? Let's run again with multiple threads and pick CPU 'Settings' --> Provider = 'cpu' 'Use default Det-Size' = TRUE 'Force CPU for Face Analyzer' = TRUE Max. Number of Threads = 26 Max. Memory to use (Gb) = 0 Image Output Format = jpg --> 'Apply Settings'

Run with 26 Threads = FAIL

Log window: Sorting videos/images Processing image(s) Processing: 0%| | 0/89 [00:00<?, ?frame/s]Forcing CPU for Face Analysis Applied providers: ['CPUExecutionProvider'], with options: {'CPUExecutionProvider': {}} find model: E:\roop-unleashed\roop-unleashed\models\buffalo_l\1k3d68.onnx landmark_3d_68 ['None', 3, 192, 192] 0.0 1.0 Applied providers: ['CPUExecutionProvider'], with options: {'CPUExecutionProvider': {}} find model: E:\roop-unleashed\roop-unleashed\models\buffalo_l\2d106det.onnx landmark_2d_106 ['None', 3, 192, 192] 0.0 1.0 Applied providers: ['CPUExecutionProvider'], with options: {'CPUExecutionProvider': {}} find model: E:\roop-unleashed\roop-unleashed\models\buffalo_l\det_10g.onnx detection [1, 3, '?', '?'] 127.5 128.0 Applied providers: ['CPUExecutionProvider'], with options: {'CPUExecutionProvider': {}} find model: E:\roop-unleashed\roop-unleashed\models\buffalo_l\genderage.onnx genderage ['None', 3, 96, 96] 0.0 1.0 Applied providers: ['CPUExecutionProvider'], with options: {'CPUExecutionProvider': {}} find model: E:\roop-unleashed\roop-unleashed\models\buffalo_l\w600k_r50.onnx recognition ['None', 3, 112, 112] 127.5 127.5 set det-size: (640, 640) 2024-06-08 20:56:29.7353060 [E:onnxruntime:, sequential_executor.cc:514 onnxruntime::ExecuteKernel] Non-zero status code returned while running Conv node. Name:'/fuse_convs_dict.64/encode_enc/conv1/Conv' Status Message: D:\a_work\1\s\onnxruntime\core\providers\cuda\cuda_call.cc:121 onnxruntime::CudaCall D:\a_work\1\s\onnxruntime\core\providers\cuda\cuda_call.cc:114 onnxruntime::CudaCall CUDA failure 2: out of memory ; GPU=0 ; hostname=AILINE ; file=D:\a_work\1\s\onnxruntime\core\providers\cuda\cuda_allocator.cc ; line=47 ; expr=cudaMalloc((void**)&p, size);

(Unclear why attempting to call cuda)

Changed 'Select video processing method' = 'Extract Frames to media'

Run with 26 Threads = 2 errors, but not catastrophic (airbrush to fix?)

Log window: Sorting videos/images Processing image(s) Processing: 0%| | 0/89 [00:00<?, ?frame/s]Forcing CPU for Face Analysis Applied providers: ['CPUExecutionProvider'], with options: {'CPUExecutionProvider': {}} find model: E:\roop-unleashed\roop-unleashed\models\buffalo_l\1k3d68.onnx landmark_3d_68 ['None', 3, 192, 192] 0.0 1.0 Applied providers: ['CPUExecutionProvider'], with options: {'CPUExecutionProvider': {}} find model: E:\roop-unleashed\roop-unleashed\models\buffalo_l\2d106det.onnx landmark_2d_106 ['None', 3, 192, 192] 0.0 1.0 Applied providers: ['CPUExecutionProvider'], with options: {'CPUExecutionProvider': {}} find model: E:\roop-unleashed\roop-unleashed\models\buffalo_l\det_10g.onnx detection [1, 3, '?', '?'] 127.5 128.0 Applied providers: ['CPUExecutionProvider'], with options: {'CPUExecutionProvider': {}} find model: E:\roop-unleashed\roop-unleashed\models\buffalo_l\genderage.onnx genderage ['None', 3, 96, 96] 0.0 1.0 Applied providers: ['CPUExecutionProvider'], with options: {'CPUExecutionProvider': {}} find model: E:\roop-unleashed\roop-unleashed\models\buffalo_l\w600k_r50.onnx recognition ['None', 3, 112, 112] 127.5 127.5 set det-size: (640, 640) Processing: 100%|███████████████████████| 89/89 [01:43<00:00, 1.16s/frame, memory_usage=24.40GB, execution_threads=26] Finished

Looks like nailing absolutely everything to CPU and it will work quite well.

But processing via GPU queue is a challenge.

Details What OS are you using?

[X ] Windows

Are you using a GPU?

[X ] No. CPU FTW
[X ] NVIDIA

Which version of roop unleashed are you using? 4.0.0

Screenshots

C0untFloyd / roop-unleashed

Failed renders when using queue, work fine in preview, GPU related #754