Open ClashSAN opened 1 year ago
How is it possible that before the commit the results were accelerated? Can you explain yourself better, I'd like to understand better.
yeah sorry. Most of the tests (the previous part I wrote) were done with https://github.com/AUTOMATIC1111/stable-diffusion-webui/commit/0b8911d883118daa54f7735c5b753b5575d9f943 I went back to test this flag: --opt-sub-quad-attention
but I also tested with --upcast-sampling
on the brkirch branch, as a replacement for --no-half
--precision full --no-half --lowvram
--opt-sub-quad-attention
--opt-sub-quad-attention --no-half-vae
--opt-sub-quad-attention --upcast-sampling
before the commit the results were accelerated?
no, for both commits where I'm testing various flags, results would be hardware accelerated, results mostly black, and larger sizes cause the device to stall, and crash.
I still have hope for this system, My 4gb laptop gpu can create at 4.6it/sec at batch size 17 512x512 in parallel. with xformers.
I got it working on windows, at 3.6s/it 512x512. https://github.com/lshqqytiger/stable-diffusion-webui-directml/discussions/14
That's great, I've never seen that fork with directml before. I guess the official webui could implement that as well then.
Have you been able to make it work on Linux?
nope, it peeves me. It could be much faster than windows, and I could use all 10gb of vram for training. (direct ml takes up all 4gb and the expanded 6gb shared gpu)
I recently went back to redo/replicate my tests, trying additional parameters. could you add this part below the section I made?
Update: (Click to expand:)
old webui commit 0b8911d brkirch branch 2cc07719 Can verify that the old webui commit (1d before automatic linux install) that the Anything-V3-pruned fp32 model would give accelerated speeds. The 4gb allocated gpu is also being used. the output is mostly always black, with an ocasional blank badge picture. There is an initial 40 second hangup when first running inference for your instance, and also when you switch sizes. Alternated between 256x256 and 192x256. When running in cpu mode instead, it is slower, but of course yields actual results. Larger sizes crash the machine. This round I had tested --opt-sub-quad-attention , --upcast-sampling, --no-half-vae, opt-split-attention-v1 (lower memory) combinations in new and old commits. Would like to try https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/3556#issuecomment-1419399947 next