ArtVentureX / sd-webui-agent-scheduler

619 stars 63 forks source link

RuntimeError: CUDA error #126

Closed GitwithDX closed 10 months ago

GitwithDX commented 1 year ago

The process always got interrupted, with this error "RuntimeError: CUDA error: misaligned address CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions."

I have device: NVIDIA GeForce RTX 3080 Ti when im running i dont really see anything wrong here. ram: free:12.89 used:3.01 total:15.9 gpu: free:7.5 used:4.5 total:12.0 gpu-active: current:2.14 peak:3.36 gpu-allocated: current:2.14 peak:3.36 gpu-reserved: current:2.17 peak:5.08 gpu-inactive: current:0.03 peak:1.0 events: retries:0 oom:0 utilization: 0 I'm not sure what the issue is. I never have any other apps running in the background. Please let me know what I can do to stop this from happening. Thank you!

pvfreis commented 12 months ago

I also have this issue, figure it might be an issue with my install or something. But it happened on a clean install on a new PC too. Same generation settings work fine without agent scheduler.

paulm commented 11 months ago

I run into this often and randomly. My task list won't run beyond a couple of hours without stopping due to this error.

artventuredev commented 11 months ago

It's unlikely the error is from the extension. Maybe related to this: https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/9954.

Dreaming-Wide-Awake commented 11 months ago

I get this same error. Rtx 3080ti laptop 16GB Vram. When error happens, VRAM is NOT maxed out. I have alienware x17 R2. I do NOT use torch, as it didn't work with laptop at time I tried it and I had to clean install A111 to fix. I use --opt-sdp-attention instead. I do not use msi afterburner, so that fix in #9954 didn't help. Any suggested fixes?

slonce70 commented 11 months ago

have this problem too only with shedule

HQJ00076 commented 11 months ago

Same issue occurs. I have never had this problem when not using this extension. Also, I don't use msi afterburner.

artventuredev commented 11 months ago

Could you please specify the types of tasks you were executing when the issue occurred? I'll attempt to recreate a similar queue to see if I can reproduce the problem.

pvfreis commented 11 months ago

It usually happens to me in the middle of a high batch count txt2img task (1x50~100). The issue generally appears after a few generations. I also used to queue more than one task like that at once.

It seems to be some sort of issue with clearing memory in between batched generations? idk

I'm using an RTX 3060 with 12GB VRAM

Here's an example of a task I usually do, and I'll try to reproduce the issue again to get the full error (and more reproduceable info):

Batch count:50, Batch size: 1 Steps: 28, Sampler: DPM++ 2M Karras, CFG scale: 7, Seed: 2087503294, Size: 512x768, Model hash: 1bab7a0895, Model: kizukiV3, VAE hash: df3c506e51, VAE: kizukiV3.vae.pt, Denoising strength: 0.45, Clip skip: 2, Hires upscale: 2.2, Hires steps: 26, Hires upscaler: 4x-UltraSharp, Lora hashes: "yamatowanpi3_64dim-5e-5: 4a5a68014e8b, shuicolor_v1: 2031bfec9abb", Version: v1.6.0-RC-12-g72ee347e

pvfreis commented 11 months ago

Got one, it happened around the 30th image on a 1x100 batch:


Exception in thread MemMon:███████████▋                                     | 1730/5600 [29:48<1:38:43,  1.53s/it] 
Traceback (most recent call last):
  File "C:\Users\paulo\AppData\Local\Programs\Python\Python310\lib\threading.py", line 1016, in _bootstrap_inner  
    self.run()
  File "E:\Programming\stable-diffusion-webui\modules\memmon.py", line 53, in run
    free, total = self.cuda_mem_get_info()
  File "E:\Programming\stable-diffusion-webui\modules\memmon.py", line 34, in cuda_mem_get_info
    return torch.cuda.mem_get_info(index)
  File "E:\Programming\stable-diffusion-webui\venv\lib\site-packages\torch\cuda\memory.py", line 618, in mem_get_info
    return torch.cuda.cudart().cudaMemGetInfo(device)
RuntimeError: CUDA error: misaligned address
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

 75%|█████████████████████████████████████████████████████████                   | 21/28 [00:34<00:11,  1.65s/it]

*** Error completing request
*** Arguments: ('task(7mve3l4s0k39mka)', ' <lora:Shidare_Hotaru:0.7> HtrShdr-KJ,  black skirt, hair ornament, bow, shirt, suspender skirt, hair ribbon, high-waist skirt, hair flower, hairband, light smile, __pose__ ,  detailed __bg1__ background, __time__ ,   (best quality, absurdres, highly detailed, intricate detail, masterpiece:1.2), <lora:shuicolor_v1:0.2>  (realistic:0.5) ', '(loli, chibi, young, futa, trans:1.4) 
(multiple views, monochrome:1.4),  (jpeg artifacts:1.4), (worst quality, low quality:1.4),  (sketch, patreon logo, watermark, comic:1.2), bad-hands-5, simple background, nude, nsfw, cg, realistic, 3d', [], 28, 'DPM++ 2M Karras', 100, 1, 7, 768, 512, True, 0.5, 2.2, '4x-UltraSharp', 0, 0, 0, 'Use same checkpoint', 'Use same sampler', '', '', ['VAE: kizukiV3.vae.pt', 'Clip skip: 2', 'Model hash: kizukiV3.safetensors [1bab7a0895]'], <agent_scheduler.task_runner.FakeRequest object at 0x000002C5E60E5660>, 0, False, '', 0.8, -1, False, -1, 0, 0, 0, 0, 4, 512, 512, True, 'None', 'None', 0, False, {'ad_model': 'face_yolov8n.pt', 'ad_prompt': '', 'ad_negative_prompt': '', 
'ad_confidence': 0.3, 'ad_mask_k_largest': 0, 'ad_mask_min_ratio': 0, 'ad_mask_max_ratio': 1, 'ad_x_offset': 0, 'ad_y_offset': 0, 'ad_dilate_erode': 4, 'ad_mask_merge_invert': 'None', 'ad_mask_blur': 4, 'ad_denoising_strength': 
0.4, 'ad_inpaint_only_masked': True, 'ad_inpaint_only_masked_padding': 32, 'ad_use_inpaint_width_height': False, 'ad_inpaint_width': 512, 'ad_inpaint_height': 512, 'ad_use_steps': False, 'ad_steps': 28, 'ad_use_cfg_scale': False, 'ad_cfg_scale': 7, 'ad_use_checkpoint': False, 'ad_checkpoint': 'Use same checkpoint', 'ad_use_vae': False, 'ad_vae': 'Use same VAE', 'ad_use_sampler': False, 'ad_sampler': 'Euler a', 'ad_use_noise_multiplier': False, 'ad_noise_multiplier': 1, 'ad_use_clip_skip': False, 'ad_clip_skip': 1, 'ad_restore_face': False, 'ad_controlnet_model': 'None', 'ad_controlnet_module': 'inpaint_global_harmonious', 'ad_controlnet_weight': 1, 'ad_controlnet_guidance_start': 0, 'ad_controlnet_guidance_end': 1, 'is_api': ()}, {'ad_model': 'None', 'ad_prompt': '', 'ad_negative_prompt': '', 'ad_confidence': 0.3, 'ad_mask_k_largest': 0, 'ad_mask_min_ratio': 0, 'ad_mask_max_ratio': 1, 'ad_x_offset': 0, 'ad_y_offset': 0, 'ad_dilate_erode': 4, 'ad_mask_merge_invert': 'None', 'ad_mask_blur': 4, 'ad_denoising_strength': 0.4, 'ad_inpaint_only_masked': True, 'ad_inpaint_only_masked_padding': 32, 'ad_use_inpaint_width_height': False, 'ad_inpaint_width': 512, 'ad_inpaint_height': 512, 'ad_use_steps': False, 'ad_steps': 28, 'ad_use_cfg_scale': False, 'ad_cfg_scale': 7, 'ad_use_checkpoint': False, 'ad_checkpoint': 'Use same checkpoint', 'ad_use_vae': False, 'ad_vae': 'Use same VAE', 'ad_use_sampler': False, 'ad_sampler': 'Euler a', 'ad_use_noise_multiplier': False, 'ad_noise_multiplier': 1, 'ad_use_clip_skip': False, 'ad_clip_skip': 1, 'ad_restore_face': False, 'ad_controlnet_model': 'None', 'ad_controlnet_module': 'inpaint_global_harmonious', 'ad_controlnet_weight': 1, 'ad_controlnet_guidance_start': 0, 'ad_controlnet_guidance_end': 1, 'is_api': ()}, True, False, 1, False, False, False, 1.1, 1.5, 100, 
0.7, False, False, True, False, False, 0, 'Gustavosta/MagicPrompt-Stable-Diffusion', '', True, 'keyword prompt', 'keyword1, keyword2', 'None', 'textual inversion first', 'None', '0.7', 'None', <scripts.animatediff_ui.AnimateDiffProcess object at 0x000002C5E6153C10>, {'is_cnet': True, 'enabled': False, 'module': 'none', 'model': 'None', 'weight': 1, 'image': None, 'resize_mode': 'Crop and Resize', 'low_vram': False, 'processor_res': 512, 'threshold_a': 
64, 'threshold_b': 64, 'guidance_start': 0, 'guidance_end': 1, 'pixel_perfect': False, 'control_mode': 'Balanced', 'is_ui': True, 'input_mode': 'simple', 'batch_images': '', 'output_dir': '', 'loopback': False}, {'is_cnet': True, 'enabled': False, 'module': 'none', 'model': 'None', 'weight': 1, 'image': None, 'resize_mode': 'Crop and Resize', 'low_vram': False, 'processor_res': 512, 'threshold_a': 64, 'threshold_b': 64, 'guidance_start': 0, 'guidance_end': 1, 'pixel_perfect': False, 'control_mode': 'Balanced', 'is_ui': True, 'input_mode': 'simple', 'batch_images': '', 'output_dir': '', 'loopback': False}, {'is_cnet': True, 'enabled': False, 'module': 'none', 'model': 'None', 
'weight': 1, 'image': None, 'resize_mode': 'Crop and Resize', 'low_vram': False, 'processor_res': 512, 'threshold_a': 64, 'threshold_b': 64, 'guidance_start': 0, 'guidance_end': 1, 'pixel_perfect': False, 'control_mode': 'Balanced', 'is_ui': True, 'input_mode': 'simple', 'batch_images': '', 'output_dir': '', 'loopback': False}, False, False, 'Matrix', 'Columns', 'Mask', 'Prompt', '1,1', '0.2', False, False, False, 'Attention', False, '0', '0', '0.4', None, '0', '0', False, False, False, 0, None, [], 0, False, [], [], False, 0, 1, False, False, 0, None, [], -2, False, [], False, 0, None, None, False, False, 'positive', 'comma', 0, False, False, '', 1, '', [], 0, '', [], 0, '', [], True, False, False, False, 0, False, None, None, False, None, None, False, None, None, False, 50, [], 30, '', 4, [], 1, '', '', '', '') {}
    Traceback (most recent call last):
      File "E:\Programming\stable-diffusion-webui\modules\call_queue.py", line 57, in f
        res = list(func(*args, **kwargs))
      File "E:\Programming\stable-diffusion-webui\modules\txt2img.py", line 55, in txt2img
        processed = processing.process_images(p)
      File "E:\Programming\stable-diffusion-webui\modules\processing.py", line 732, in process_images
        res = process_images_inner(p)
      File "E:\Programming\stable-diffusion-webui\extensions\sd-webui-controlnet\scripts\batch_hijack.py", line 42, in processing_process_images_hijack
        return getattr(processing, '__controlnet_original_process_images_inner')(p, *args, **kwargs)
      File "E:\Programming\stable-diffusion-webui\modules\processing.py", line 867, in process_images_inner       
        samples_ddim = p.sample(conditioning=p.c, unconditional_conditioning=p.uc, seeds=p.seeds, subseeds=p.subseeds, subseed_strength=p.subseed_strength, prompts=p.prompts)
      File "E:\Programming\stable-diffusion-webui\modules\processing.py", line 1156, in sample
        return self.sample_hr_pass(samples, decoded_samples, seeds, subseeds, subseed_strength, prompts)
      File "E:\Programming\stable-diffusion-webui\modules\processing.py", line 1242, in sample_hr_pass
        samples = self.sampler.sample_img2img(self, samples, noise, self.hr_c, self.hr_uc, steps=self.hr_second_pass_steps or self.steps, image_conditioning=image_conditioning)
      File "E:\Programming\stable-diffusion-webui\modules\sd_samplers_kdiffusion.py", line 188, in sample_img2img 
        samples = self.launch_sampling(t_enc + 1, lambda: self.func(self.model_wrap_cfg, xi, extra_args=self.sampler_extra_args, disable=False, callback=self.callback_state, **extra_params_kwargs))
      File "E:\Programming\stable-diffusion-webui\modules\sd_samplers_common.py", line 261, in launch_sampling    
        return func()
      File "E:\Programming\stable-diffusion-webui\modules\sd_samplers_kdiffusion.py", line 188, in <lambda>       
        samples = self.launch_sampling(t_enc + 1, lambda: self.func(self.model_wrap_cfg, xi, extra_args=self.sampler_extra_args, disable=False, callback=self.callback_state, **extra_params_kwargs))
      File "E:\Programming\stable-diffusion-webui\venv\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
        return func(*args, **kwargs)
      File "E:\Programming\stable-diffusion-webui\repositories\k-diffusion\k_diffusion\sampling.py", line 605, in 
sample_dpmpp_2m
        x = (sigma_fn(t_next) / sigma_fn(t)) * x - (-h).expm1() * denoised_d
    RuntimeError: CUDA error: misaligned address
    CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
    For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
    Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

---
Exception in thread Thread-33 (execute_task):
Traceback (most recent call last):
  File "C:\Users\paulo\AppData\Local\Programs\Python\Python310\lib\threading.py", line 1016, in _bootstrap_inner  
    self.run()
  File "C:\Users\paulo\AppData\Local\Programs\Python\Python310\lib\threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "E:\Programming\stable-diffusion-webui\extensions\sd-webui-agent-scheduler\agent_scheduler\task_runner.py", line 344, in execute_task
    res = self.__execute_task(task_id, is_img2img, task_args)
  File "E:\Programming\stable-diffusion-webui\extensions\sd-webui-agent-scheduler\agent_scheduler\task_runner.py", line 434, in __execute_task
    return self.__execute_ui_task(task_id, is_img2img, *ui_args)
  File "E:\Programming\stable-diffusion-webui\extensions\sd-webui-agent-scheduler\agent_scheduler\task_runner.py", line 468, in __execute_ui_task
    shared.state.end()
  File "E:\Programming\stable-diffusion-webui\modules\shared_state.py", line 128, in end
    devices.torch_gc()
  File "E:\Programming\stable-diffusion-webui\modules\devices.py", line 51, in torch_gc
    torch.cuda.empty_cache()
  File "E:\Programming\stable-diffusion-webui\venv\lib\site-packages\torch\cuda\memory.py", line 133, in empty_cache
    torch._C._cuda_emptyCache()
RuntimeError: CUDA error: misaligned address
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.`
pvfreis commented 11 months ago

I'm checking right now to see if running many identical tasks with a 25 batch count each might help workaround the issue. If that's the case, it would be nice to have an option to easily enqueue multiple copies of the same task at once.

artventuredev commented 11 months ago

Last night, I queued a task with batch_count 100, 768 x 768, and hires fix 1.2 (my 2060 struggles with larger sizes), and it completed without any issues. I'll attempt a few more tasks to see if I can replicate the problem. Did reducing the batch_count to 25 provide any relief for you?

pvfreis commented 11 months ago

Unfortunately, it did not. It got into the same misaligned address error after a couple tasks. 😔

I saw you mentioned in another issue there might be some problem with xformers (altough I'm not sure if they're related) should I try sdp? (But I admit I'll really hate if that's the solution because it increases the inference time for me by about 25% per image.)

Will also try later with some extensions disabled (even if I'm not actively using them) to see if that might interfere somehow. Will let you know both results.

HQJ00076 commented 11 months ago

Generation stopped several times a day due to this problem, but when I updated Cuda from 11.8 to 12.2 using the instructions below, it has not occurred once in 3 days. I'll wait and see. Cuda 12.2 New Libs

pvfreis commented 11 months ago

Generation stopped several times a day due to this problem, but when I updated Cuda from 11.8 to 12.2 using the instructions below, it has not occurred once in 3 days. I'll wait and see. Cuda 12.2 New Libs

Updated to 12.2 following the instructions above and ran 4 batches of 100 with no issue (never got even close to that before), still testing, but I think that fixes the issue.

artventuredev commented 10 months ago

Since it's resolved, I'll close the issue then.

dm18 commented 9 months ago

I use the agent scheduler API to submit large batches of prompts. I will que up several thousands of prompts. And let it automatic1111 run for 8, or 17+ hours straight.

My experience on CUDA 11.3, with agent scheduler,

My experience on CUDA 12.2, with agent scheduler,

After experiencing more issues with CUDA 12.2, I've downgraded back to CUDA 11.3. And I'm again able to generate for 8+ hour without issue. This is with the Web UI closed.