AUTOMATIC1111 / stable-diffusion-webui

Stable Diffusion web UI
GNU Affero General Public License v3.0
139.21k stars 26.42k forks source link

[Bug]: Race condition when hitting "Generate" #6898

Closed lbeltrame closed 1 year ago

lbeltrame commented 1 year ago

Is there an existing issue for this?

What happened?

This issue became evident when the newer Gradio versions were added as a dependency, but it does exist also with older Gradio versions.

Basically, hitting "Generate", causes a series of requests to /predict, and at some point, the request, for reasons yet unknown, will cause a 504 error: this will in turn generate a malformed JSON response that will ultimately break the UI (pictures do not show, "Generate" button no longer responsive).

This doesn't occur with the older Gradio versions because they go through a SSH tunnel, but the new solution from Gradio 3.13.0 onwards is much faster, and triggers the behavior. It does occur as well if you have some tunneling solution like rathole (IOW, tunneling the web UI port somewhere else).

I'm however unable to test if this occurs with a local installation (I run this in Paperspace).

I've verified the problem with:

Steps to reproduce the problem

  1. Use the latest WebUI version where Gradio 3.15.0 or higher is required
  2. Make sure you're on a new gradio.link share, and not the old gradio.app one
  3. Create a batch of at least 3 images
  4. Hit "Generate"

What should have happened?

The web UI should complete the request and show the images. Furthermore, newer generations should be possible.

Commit where the problem happens

d8f8bcb8

What platforms do you use to access UI ?

Other/Cloud

What browsers do you use to access the UI ?

Mozilla Firefox, Google Chrome

Command Line Arguments

--no-half-vae --api --no-progressbar-hiding --api-auth user:pass --disable-console-progressbars  --opt-sub-quad-attention --share --gradio-auth user:pass

Additional information, context and logs

No response

ataa commented 1 year ago

This issue became evident when the newer Gradio versions were added as a dependency, but it does exist

I just tested it on Colab and installed Gradio 3.16.2 successfully. Screenshot 2023-01-19 at 01-48-20 Google Colaboratory

lbeltrame commented 1 year ago

It's a different problem. The issue occurs when running the webui. Many people have complained also in other issues that the newer Gradio breaks the generate button. However it's not an issue related to a specific Gradio version. I saw it also with 3.16.2. It's not evident unless you open the Firefox or Chrome inspector and debug the requests done by the webui during the generation of the image.

lbeltrame commented 1 year ago

This is what happens when the race condition is triggered:

immagine

However I have no way to understand what's going on, certainly not by going through minified JS. Notice that this is with Gradio 3.12.0, but tunneled through a different type of tunnel and not the SSH/paramiko based solution Gradio uses < 3.13.0.

ataa commented 1 year ago

There are few gradio fixes after the commit you're using, do a git pull, it might help.

lbeltrame commented 1 year ago

As I wrote before, this is independent from Gradio version. It never occurs when the network link is slow enough to add some latency. Most of the fixes landing in git touch the UI, and not that specific part. The only commit touching the js part is 3b61007a66d9f7c05fcce1a461d5907c1ce633dd, and it looks completely unrelated.

EDIT: if I were to look, https://github.com/AUTOMATIC1111/stable-diffusion-webui/blob/bb0978ecfd3177d0bfd7cacd1ac8796d7eec2d79/javascript/progressbar.js#L76 is probably one of the places where the actual issue takes place. But again, I don't know much about JS to be sure.

AI-Casanova commented 1 year ago

Thank you! I know I'm not alone.

I've been having this issue a lot, and I just noticed the command line error today in Colab.

Hopefully they can fix this, it slows me down so much to run a batch, sift through my gdrive, download, upload to PNG Info, repetitio ad nauseam 🤕

lbeltrame commented 1 year ago

I spent some time debugging this and it goes all over my head (I'm a Pythonista, so I'm not too familiar with JS). From what I see, a call times out (after about ~5000ms) and there's a 504 response from Gradio. The webui (or Gradio itself, I wasn't able to locate which component does this) receives "garbage" (actually an empty response) and the JSON parsing of the response fails, leaving the whole system in an inconsistent state.

I thought it was the webui but it might be a Gradio bug as well, as the JS in the webui parses JSON only if there's a 200 response.

I see that there are a few options related to debugging. I'll look into them later today to see what's going on.

lbeltrame commented 1 year ago

Thanks @AUTOMATIC1111. I'll be testing these changes in a few hours and will report back if they improve the situation.

mykeehu commented 1 year ago

Unfortunately this fix still does not solve the colab problem :( https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/6840 https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/6914

AI-Casanova commented 1 year ago

@mykeehu My Colab instance works now with both batch grids and X/Y plot.

Still got a "Not interface running" issue which superficially looks the same until you try reloading the page.

mykeehu commented 1 year ago

I was just hoping that this fix would solve the status indicator and the Generate button bug, but no.

AI-Casanova commented 1 year ago

@mykeehu are you using txt2img or img2img?

My testing of the new commit shows expected behavior on txt2img, but I'm also missing the progress bar and previews on img2img

lbeltrame commented 1 year ago

I just tested img2img. I see both the preview and the Generate button works. txt2img works as well. I'm going to leave this open just in case, but for me c12d7dd fixes the issue.

mykeehu commented 1 year ago

On a local machine, there is nothing wrong with the button, but on colab, I press the txt2img tab, the status indicator and the cancel button flash, and then Generate appears again, as you can see in this video for others.

lbeltrame commented 1 year ago

Spoke too soon. It seems it still breaks, but only with long running tasks, such as large batches with hires fix. I'm testing the latest commit as we speak because I saw the 404 error for /internal/progress.

lbeltrame commented 1 year ago

Weird indeed. I did some runs earlier and it did not break at all. Now it's the same as before.

EDIT: I do wonder if any features of the host are impacting this. I had just started the VM and the behavior didn't manifest. Perhaps after a while under load.

If anyone can test on Colab (I can't at this time), does this happen immediately, or after a while it's running and generating things?

lbeltrame commented 1 year ago

Some extra debugging I did with Chromium inspection tools:

It fails here, in the Gradio JS sources:


}
const mc = "This application is too busy. Keep trying!"
  , bl = "Connection errored out.";
async function wl(t, e) {
    try {
        var n = await fetch(t, {
            method: "POST",
            body: JSON.stringify(e),
            headers: {
                "Content-Type": "application/json"
            }
        })
    } catch {
        return [{
            error: bl
        }, 500]
    }
    return [await n.json(), n.status]
}

e is not JSON, because it's actually an HTML page (504 Gateway Time-Out), and everything breaks. Why this 504 is triggered, I have no idea.

lbeltrame commented 1 year ago

I opened https://github.com/gradio-app/gradio/discussions/3029 to see whether someone more familiar with Gradio can help in debugging this.

jjtolton commented 1 year ago

@Ibelt

Is there an existing issue for this?

  • [x] I have searched the existing issues and checked the recent builds/commits

What happened?

This issue became evident when the newer Gradio versions were added as a dependency, but it does exist also with older Gradio versions.

Basically, hitting "Generate", causes a series of requests to /predict, and at some point, the request, for reasons yet unknown, will cause a 504 error: this will in turn generate a malformed JSON response that will ultimately break the UI (pictures do not show, "Generate" button no longer responsive).

This doesn't occur with the older Gradio versions because they go through a SSH tunnel, but the new solution from Gradio 3.13.0 onwards is much faster, and triggers the behavior. It does occur as well if you have some tunneling solution like rathole (IOW, tunneling the web UI port somewhere else).

I'm however unable to test if this occurs with a local installation (I run this in Paperspace).

I've verified the problem with:

  • Gradio 3.15.0 (Colab, Paperspace)
  • Gradio 3.12.0 (Cloud + rathole tunnel)

Steps to reproduce the problem

  1. Use the latest WebUI version where Gradio 3.15.0 or higher is required
  2. Make sure you're on a new gradio.link share, and not the old gradio.app one
  3. Create a batch of at least 3 images
  4. Hit "Generate"

What should have happened?

The web UI should complete the request and show the images. Furthermore, newer generations should be possible.

Commit where the problem happens

d8f8bcb

What platforms do you use to access UI ?

Other/Cloud

What browsers do you use to access the UI ?

Mozilla Firefox, Google Chrome

Command Line Arguments

--no-half-vae --api --no-progressbar-hiding --api-auth user:pass --disable-console-progressbars  --opt-sub-quad-attention --share --gradio-auth user:pass

Additional information, context and logs

No response

Can you describe what you are seeing exactly, in order to replicate? You click the "generate" button, and after awhile the UI breaks? In txt2img, img2img, everything... or?

lbeltrame commented 1 year ago

It needs some significant processing to break, light loads won't trigger the behavior. So far I've seen this only in Colab or Paperspace. I cant' test locally. An example is generating 15 pictures at 512x512 (5 batches, 3 batch size) + 2x hires fix. On a Colab instance this takes 7-8 minutes. If you watch the JS console, the error pops up halfway during the process. It breaks almost everything: I've seen mostly in txt2img, but Extras is also affected. However it does not break if the connection is slow enough.

What happens is that the application as a whole is in an inconsistent state. Some parts still work (e.g. settings), but generating new pictures does not (IOW, the Generate button does not do anything, which is the most evident issue), and the generated pictures do not show up.

jjtolton commented 1 year ago

I have experienced similar issues locally where there UI gets in an inconsistent state with the backend. In my case, what happens is that the backend is still processing, but the "interrupt" and "skip" buttons have disappeared and the "generate" button reappears. Usually I just get impatient and kill the process and restart the webui. Does that sound similar to what is happening to you?

jjtolton commented 1 year ago

Hitting the "generate" button in this case does nothing b/c the backend is still processing

jjtolton commented 1 year ago

It needs some significant processing to break, light loads won't trigger the behavior. So far I've seen this only in Colab or Paperspace. I cant' test locally. An example is generating 15 pictures at 512x512 (5 batches, 3 batch size) + 2x hires fix. On a Colab instance this takes 7-8 minutes. If you watch the JS console, the error pops up halfway during the process. It breaks almost everything: I've seen mostly in txt2img, but Extras is also affected. However it does not break if the connection is slow enough.

What happens is that the application as a whole is in an inconsistent state. Some parts still work (e.g. settings), but generating new pictures does not (IOW, the Generate button does not do anything, which is the most evident issue), and the generated pictures do not show up.

@lbeltrame :point_up:

jjtolton commented 1 year ago

It sounds like the big picture issue is the fact that the UI relies on open loop control to infer the state of the backend. Like any open loop control system, if something goes off the rails, it's very difficult to recover. @lbeltrame next time this happens, could you check in the folders on Google Collab and check to see if your work is still processing? It's a fair question to ask if it's reasonable for a free tool to support fixing this issue (which, if I'm correct, is more of a UX issue than a functionality issue). If you are able to get your pictures and wait for the process to finish (perhaps by checking console output?), the "Generate" button should work again (if not, restarting the application would work again).

So I see three possibilities:

  1. This is not an accurate description of the problem and the entire app dies, you are unable to get your work, images do not continue processing, and/or there is a great deal of difficulty in restarting the app
  2. You are able to obtain your images and wait for the work to continue processing and either restart the app without much difficulty and/or the generate button works again, and this is an acceptable conclusion
  3. Same as item 2, except you feel that this is not an acceptable UX tradeoff and you feel more guard rails should be put in place to reinforce the open loop control between the UI and the backend
lbeltrame commented 1 year ago

To answer your questions:

  1. To me it looks like a pure frontend problem. The backend keeps going and processes the images, however communication with the frontend is partially disrupted: the most evident symptom is that at the end of a generation, images are not displayed, and that means that for whatever reason you haven't checked "always save generated images" in the settings, they are lost. This is problematic for places like Colab or Paperspace because of disk space constraints (Paperspace in particular).
  2. As I said, probably on a "regular" machine it would not be visible, because it only happens when there is significant load and the backend (I think) does not respond in time to the frontend (approx 6000ms later). The issue may be even minor, in the sense that there's an uncaught exception: an acceptable workaround (if the problem is there) would be to actually guard against it.
  3. I think it's not an acceptable UX tradeoff because you have "data loss" (quotes intended) if you don't save images while generating. It's also a pretty bad UX, because you have to reload at every generation. I also think (but I need to check again) that other parts of the UI start behaving erratically because events are no longer processed properly.

EDIT: I know what you mean about the backend actually working with the buttons not displaying, or staying at "Generate",. However this is significantly worse because in the former case, the state is somehow restored at least in some occasions. In this case, it is unrecoverable.

jjtolton commented 1 year ago

Makes sense. Will investigate possible remediations. Seems like a debounce on the callbacks will solve the timing issues. As long as it doesn't cause any breaking changes it shouldn't be too hard to implement. There are some other options but @AUTOMATIC1111 identified a solid possible implementation for the debounce, so as long as it is scalable (doesn't ruin all the existing tools that use the "generate" button and fits within design philosophy) I'll be happy to put it in.

AUTOMATIC1111 commented 1 year ago

try the new --gradio-queue option. it uses a different way to connect to the server that may be protected from a timeout

lbeltrame commented 1 year ago

On it, will try. I'll report back.

lbeltrame commented 1 year ago

Tests (ongoing, will update the post as they finish):

Systems tested:

The issue to me looks fixed if queueing is enabled. To those reading: if you have a guide on using this webui on a shared platform, you might want suggest using this option (everyone else can move along, nothing to see here).

Final result: It handled everything I threw at it, including long running generations with lots of computation time. I can call the issue fixed for me at least.

And thanks to @AUTOMATIC1111 for adding this option. I was about to test this myself, but you got there first (and reading the Gradio docs, I see why it would work better for this specific setup).

AUTOMATIC1111 commented 1 year ago

Do you also get progress bar and live previews?

AUTOMATIC1111 commented 1 year ago

Also if relevant I plan to make gradio queue on by default after this is fixed:

https://github.com/gradio-app/gradio/issues/2980 https://github.com/gradio-app/gradio/pull/3022

lbeltrame commented 1 year ago

@AUTOMATIC1111 I had live previews and progress bar working OK. (Linux, Firefox 108.0). Both on Paperspace and Colab (going through the gradio.live link).

vt-idiot commented 1 year ago

Still bugged out without using the new "queue" option. Going to try relaunching with it now to see if it's fixed

python: 3.8.10  •  torch: 1.13.1+cu116  •  xformers: 0.0.15.dev+4c06c79.d20221205  •  gradio: 3.16.2  •  commit: f53527f7  •  checkpoint: 7845f59493

vt-idiot commented 1 year ago

Do you also get progress bar and live previews?

No change whatsoever here, even with --gradio-queue. Clicking generate shows the progress bar and the two grey buttons for a split second, then the preview window disappears, Generate turns red, and the UI is completely unresponsive until refreshed the final image eventually shows up. The image can still be seen generating (and hi-res fixing) in the console output.

See 2nd comment. It's an improvement for sure, but I wouldn't call it fixed.

vt-idiot commented 1 year ago

Clicking Generate without hi-res fix will still show the output image at the end, but no progress bar, no previews.

There's a 404 error shown for xyz.gradio.live/internal/progress

image

There's a biiiiiiiiiiig error when starting a generation (even 1 batch of 1 image) with hi-res fix on

Uncaught (in promise) TypeError: L[zt[Gt]] is undefined
    st index.98155427.js:76
    st index.98155427.js:76
    promise callback*At/</</< index.98155427.js:76
    le index.98155427.js:4
    le index.98155427.js:4
    b Checkbox.svelte:14
    Vn index.98155427.js:4
    Vn index.98155427.js:4
    f Checkbox.svelte:17
    u Checkbox.svelte:27
    K index.98155427.js:1
    m Checkbox.svelte:34
    pt index.98155427.js:4
    m Checkbox.svelte:24
    m index.98155427.js:76
    m index.98155427.js:76
    pt index.98155427.js:4
    m Checkbox.svelte:20
    pt index.98155427.js:4
    m index.98155427.js:34
    pt index.98155427.js:4
    m index.98155427.js:34
    m index.98155427.js:34
    m index.98155427.js:34
    m Row.svelte:10
    pt index.98155427.js:4
    m index.98155427.js:34
    pt index.98155427.js:4
    m index.98155427.js:34
    m index.98155427.js:34
    m index.98155427.js:34
    m Column.svelte:13
    pt index.98155427.js:4
    m index.98155427.js:34
    pt index.98155427.js:4
    m index.98155427.js:34
    m index.98155427.js:34
    m index.98155427.js:34
    m Row.svelte:10
    pt index.98155427.js:4
[index.98155427.js:76:2903](https://gradio.s3-us-west-2.amazonaws.com/3.16.2/assets/index.98155427.js)
    st index.98155427.js:76
    forEach self-hosted:203
    st index.98155427.js:76
    (Async: promise callback)
    At index.98155427.js:76
    le index.98155427.js:4
    forEach self-hosted:203
    le index.98155427.js:4
    b Checkbox.svelte:14
    Vn index.98155427.js:4
    forEach self-hosted:203
    Vn index.98155427.js:4
    f Checkbox.svelte:17
    u Checkbox.svelte:27
    (Async: EventListener.handleEvent)
    K index.98155427.js:1
    m Checkbox.svelte:34
    pt index.98155427.js:4
    m Checkbox.svelte:24
    m index.98155427.js:76
    m index.98155427.js:76
    pt index.98155427.js:4
    m Checkbox.svelte:20
    pt index.98155427.js:4
    m index.98155427.js:34
    pt index.98155427.js:4
    m index.98155427.js:34
    m index.98155427.js:34
    m index.98155427.js:34
    m Row.svelte:10
    pt index.98155427.js:4
    m index.98155427.js:34
    pt index.98155427.js:4
    m index.98155427.js:34
    m index.98155427.js:34
    m index.98155427.js:34
    m Column.svelte:13
    pt index.98155427.js:4
    m index.98155427.js:34
    pt index.98155427.js:4
    m index.98155427.js:34
    m index.98155427.js:34
    m index.98155427.js:34
    m Row.svelte:10
    pt index.98155427.js:4

The UI is at least functional now after the generation though - the hi-res fix image shows up and I can generate a new image or set of images.

lbeltrame commented 1 year ago

Those errors, FTR, were there even before the recent commits. But even like that the specific issue from this bug report, that is that POST timeouts broke the UI completely, looks absent from your report.

vt-idiot commented 1 year ago

Those errors, FTR, were there even before the recent commits.

I never bothered investigating what it looked like before. I'm wondering why I can't get a progress bar or preview but you can though. gradio-queue did fix the POST timeouts breaking the UI completely, yes, it is functional again once the output image finally pops up, no more hard refresh required.

mykeehu commented 1 year ago

No change whatsoever here, even with --gradio-queue. Clicking generate shows the progress bar and the two grey buttons for a split second, then the preview window disappears, Generate turns red, and the UI is completely unresponsive until refreshed the final image eventually shows up. The image can still be seen generating (and hi-res fixing) in the console output.

Same here with --gradio-queue command with f53527f version on colab

lbeltrame commented 1 year ago

IMO these are different issues and deserve a separate issue. The actual issue described here is fixed.

lbeltrame commented 1 year ago

Closing, since the issue referenced is fixed.

(But I'd argue against being "mislabeled", it is a sort of race, because if the connection to the webui is slow enough, POST requests won't come as fast and they won't trigger the behavior).