daniandtheweb / ai-info-for-amd-gpus

A compilation of information on how to run AI workloads on AMD gpus.
4 stars 1 forks source link

More Valve Steam deck detail for previous test #3

Open ClashSAN opened 1 year ago

ClashSAN commented 1 year ago

I recently went back to redo/replicate my tests, trying additional parameters. could you add this part below the section I made?

Update: (Click to expand:) old webui commit 0b8911d brkirch branch 2cc07719 Can verify that the old webui commit (1d before automatic linux install) that the Anything-V3-pruned fp32 model would give accelerated speeds. The 4gb allocated gpu is also being used. the output is mostly always black, with an ocasional blank badge picture. There is an initial 40 second hangup when first running inference for your instance, and also when you switch sizes. Alternated between 256x256 and 192x256. When running in cpu mode instead, it is slower, but of course yields actual results. Larger sizes crash the machine. This round I had tested --opt-sub-quad-attention , --upcast-sampling, --no-half-vae, opt-split-attention-v1 (lower memory) combinations in new and old commits. Would like to try https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/3556#issuecomment-1419399947 next
daniandtheweb commented 1 year ago

How is it possible that before the commit the results were accelerated? Can you explain yourself better, I'd like to understand better.

ClashSAN commented 1 year ago

yeah sorry. Most of the tests (the previous part I wrote) were done with https://github.com/AUTOMATIC1111/stable-diffusion-webui/commit/0b8911d883118daa54f7735c5b753b5575d9f943 I went back to test this flag: --opt-sub-quad-attention

but I also tested with --upcast-sampling on the brkirch branch, as a replacement for --no-half

--precision full --no-half --lowvram 
--opt-sub-quad-attention
--opt-sub-quad-attention --no-half-vae
--opt-sub-quad-attention --upcast-sampling

before the commit the results were accelerated?

no, for both commits where I'm testing various flags, results would be hardware accelerated, results mostly black, and larger sizes cause the device to stall, and crash.

I still have hope for this system, My 4gb laptop gpu can create at 4.6it/sec at batch size 17 512x512 in parallel. with xformers.

ClashSAN commented 1 year ago

I got it working on windows, at 3.6s/it 512x512. https://github.com/lshqqytiger/stable-diffusion-webui-directml/discussions/14

daniandtheweb commented 1 year ago

That's great, I've never seen that fork with directml before. I guess the official webui could implement that as well then.

daniandtheweb commented 1 year ago

Have you been able to make it work on Linux?

ClashSAN commented 1 year ago

nope, it peeves me. It could be much faster than windows, and I could use all 10gb of vram for training. (direct ml takes up all 4gb and the expanded 6gb shared gpu)