AUTOMATIC1111 / stable-diffusion-webui-tensorrt

MIT License
311 stars 20 forks source link

can't generate images above 768x768 resolution #10

Open pellaaa93 opened 1 year ago

pellaaa93 commented 1 year ago

Testing the extension it works very well (great performance boost in 512x512 generations) but I can't create TensorRT models that can generate images above 768x768... if I set Maximum width/height to a value above 768 I got this error and can't convert the model... any idea?

[05/28/2023-16:40:25] [E] Error[4]: kOPT values for profile 0 violate shape constraints: /input_blocks.1/input_blocks.1.1/transformer_blocks.0/attn1/Einsum_output_0: tensor volume exceeds (2^31)-1, dimensions are [16,E0,E0] where E0=( height width) Volume exceeds 2^31-1. [05/28/2023-16:40:25] [E] Error[4]: [shapeCompiler.cpp::nvinfer1::builder::DynamicSlotBuilder::evaluateShapeChecks::1276] Error Code 4: Internal Error (kOPT values for profile 0 violate shape constraints: /input_blocks.1/input_blocks.1.1/transformer_blocks.0/attn1/Einsum_output_0: tensor volume exceeds (2^31)-1, dimensions are [16,E0,E0] where E0=( height width) Volume exceeds 2^31-1.) [05/28/2023-16:40:25] [E] Engine could not be created from network [05/28/2023-16:40:25] [E] Building engine failed [05/28/2023-16:40:25] [E] Failed to create engine from model or file. [05/28/2023-16:40:25] [E] Engine set up failed &&&& FAILED TensorRT.trtexec [TensorRT v8601] # E:\Programmi\Vision of Chaos\MachineLearning\Text To Image\stable-diffusion-webui-dev\extensions\stable-diffusion-webui-tensorrt-master\TensorRT-8.6.1.6\bin\trtexec.exe --onnx=E:\Programmi\Vision of Chaos\MachineLearning\Text To Image\stable-diffusion-webui-dev\models\Unet-onnx\dreamshaper_6BakedVae.onnx --saveEngine=E:\Programmi\Vision of Chaos\MachineLearning\Text To Image\stable-diffusion-webui-dev\models\Unet-trt\dreamshaper_6BakedVae.trt --minShapes=x:2x4x64x64,context:2x77x768,timesteps:2 --maxShapes=x:2x4x112x112,context:2x77x768,timesteps:2 --fp16 Error completing request Arguments: ('', 'E:\Programmi\Vision of Chaos\MachineLearning\Text To Image\stable-diffusion-webui-dev\models\Unet-onnx\dreamshaper_6BakedVae.onnx', 1, 1, 75, 75, 512, 896, 512, 896, True, '') {} Traceback (most recent call last): File "E:\Programmi\Vision of Chaos\MachineLearning\Text To Image\stable-diffusion-webui-dev\modules\call_queue.py", line 57, in f res = list(func(*args, *kwargs)) File "E:\Programmi\Vision of Chaos\MachineLearning\Text To Image\stable-diffusion-webui-dev\modules\call_queue.py", line 37, in f res = func(args, **kwargs) File "E:\Programmi\Vision of Chaos\MachineLearning\Text To Image\stable-diffusion-webui-dev\extensions\stable-diffusion-webui-tensorrt-master\ui_trt.py", line 69, in convert_onnx_to_trt launch.run(command, live=True) File "E:\Programmi\Vision of Chaos\MachineLearning\Text To Image\stable-diffusion-webui-dev\modules\launch_utils.py", line 101, in run raise RuntimeError("\n".join(error_bits)) RuntimeError: Error running command. Command: "E:\Programmi\Vision of Chaos\MachineLearning\Text To Image\stable-diffusion-webui-dev\extensions\stable-diffusion-webui-tensorrt-master\TensorRT-8.6.1.6\bin\trtexec.exe" --onnx="E:\Programmi\Vision of Chaos\MachineLearning\Text To Image\stable-diffusion-webui-dev\models\Unet-onnx\dreamshaper_6BakedVae.onnx" --saveEngine="E:\Programmi\Vision of Chaos\MachineLearning\Text To Image\stable-diffusion-webui-dev\models\Unet-trt\dreamshaper_6BakedVae.trt" --minShapes=x:2x4x64x64,context:2x77x768,timesteps:2 --maxShapes=x:2x4x112x112,context:2x77x768,timesteps:2 --fp16 Error code: 1

MoreColors123 commented 1 year ago

I'm not 100% sure but may have read somewhere that it's not possible to go above 768. What were your setup arguments? Here it didn't even start the conversion with a batch size higher than 1, on a 3060 12 GB.

Conversion is running rn and I'm quite exited for the results. We should start uploading converted models somewhere, what do you guys think?

pellaaa93 commented 1 year ago

everything default except maximum width/height set to 768 (max possible rn) and of course model path... if cant go above 768 is very sad because the performance boost is huge but that resolution is too low for my workflow

gabriel-peracio commented 1 year ago

Check out this post reply from nvidia: https://forums.developer.nvidia.com/t/tensor-volume-exceeds-2-31-1/203701/8

I think uploading converted models is sort of a moot point, there are way too many variations, it's a combinatorial explosion:

Also, does it produce different outputs for different hardware, even if you use the same settings?

pellaaa93 commented 1 year ago

damn looks like a limitation that can't be bypassed easily

neurogen-dev commented 1 year ago

As such, there should be no hard limit. I checked with other, separate TensorRT-based implementations of Stable Diffusion and resolutions greater than 768 worked there. So maybe just need to find a solution for this implementation from automatic1111

AugmentedRealityCat commented 1 year ago

If we can get this to work with tiled diffusion and tiled vae then I suppose it could work at any resolution, as long as the tile size remains equal to the one used to generate the TRT file - 512x512 for example.

This is just something I suppose, it might be completely wrong.

EDIT: IT (almost) WORKS ! I just created a 4096x2048 image using Tiled Diffusion + Tiled VAE but as you can see there are image coherence problem with each tile showing its own take on the prompt. This whole 4K image was generated in 42 seconds ! 00062-1852145334

a train locomotive <lora:detailmaker(1):0.5> <lora:epiNoiseoffset_v2:1>
Steps: 20, Sampler: DDIM, CFG scale: 7, Seed: 1852145334, Size: 4096x2048, Model hash: 02aecf0c7d, Model: revAnimated_v12, Tiled Diffusion: {"Method": "MultiDiffusion", "Tile tile width": 96, "Tile tile height": 96, "Tile Overlap": 48, "Tile batch size": 1}, Lora hashes: "detailmaker(1): e1b1a08b43b5, epiNoiseoffset_v2: d1131f7207d6", Version: v1.3.0-71-gf9809e6e

Time taken: 41.82s

Torch active/reserved: 12864/17494 MiB, Sys VRAM: 23922/24564 MiB (97.39%)

To compare, here is the same prompt, same seed, same parameters, but without the use of TensorRT optimization - so it's much slower, it almost doubles the render times to 1m14s, and the image remains a big mess because this is from txt2img without highres fix :

00063-1852145334

a train locomotive <lora:detailmaker(1):0.5> <lora:epiNoiseoffset_v2:1>
Steps: 20, Sampler: DDIM, CFG scale: 7, Seed: 1852145334, Size: 4096x2048, Model hash: 02aecf0c7d, Model: revAnimated_v12, Tiled Diffusion: {"Method": "MultiDiffusion", "Tile tile width": 96, "Tile tile height": 96, "Tile Overlap": 48, "Tile batch size": 1}, Lora hashes: "detailmaker(1): e1b1a08b43b5, epiNoiseoffset_v2: d1131f7207d6", Version: v1.3.0-71-gf9809e6e

Time taken: 1m 13.67s

Torch active/reserved: 15081/18816 MiB, Sys VRAM: 22614/24564 MiB (92.06%)
AugmentedRealityCat commented 1 year ago

I'm currently making a test at 768x768 max resolution (after failing with 1024x1024 like everyone else) and something we can observe is the following message in the log:

[libprotobuf WARNING **************************************************************************\externals\protobuf\3.0.0\src\google\protobuf\io\coded_stream.cc:604] Reading dangerously large protocol message.  If the message turns out to be larger than 2147483647 bytes, parsing will be halted for security reasons.  To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
[libprotobuf WARNING **************************************************************************\externals\protobuf\3.0.0\src\google\protobuf\io\coded_stream.cc:81] The total number of bytes read was 1721624086

The important part here is the number 2147483647, which is exactly equal to (2^31)-1, which is described as the limit in the error message when you try 1024x1024 or more.

So, my question is how can we change that limit in CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h ? Anyone has experience with this and protobuf ?

2Raven2 commented 1 year ago

I tested how far I could bring the resolution, and was able to go up to 832x832 (minimum 512x512) with a batch size of 1 and 150 tokens max.

I also noticed that you're able to lower the shape size if you match the minimum and maximum resolution (or bringing them both closer together), allowing you to perhaps get a higher batch size/token count. Though obviously this comes at a cost to flexibility because your TRT model can now only work with that set resolution

JilekJosef commented 1 year ago

how can we change that limit in CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h

Well you can recompile it probably or find if someone did it already obrazek

MoreColors123 commented 1 year ago

i just got it to compile an onxx model with 512 max width and height, 150 max tokens and batch size 6 - on a 3060 12 GB. I don't understand why that works of a sudden, because last time i tried anything above batch size 2, it wouldn even start compiling. but it works now, and i get 2,45 it/s when generating 6 images on 512x512. which means 14,7 it/s total

eyeweaver commented 1 year ago

I think the problem is with the max width and height.. not anything alese.. it wouldn't convert the model for any height or widrh above 768 .. and even the converted model with 768px max height and width doesn't work.. it gives you error messages whenever you try to generate any images with it.

jebarpg commented 1 year ago

I made a fix for all these issues. You can check out my fork with the changes here: https://github.com/jebarpg/stable-diffusion-webui-tensorrt I did all the manual testing and discovered the limits of all the shapes you can create with max width, height and batch sizes. The best batch size and max width and height I have found is bs: 7 maxW: 512 maxH: 512. You can max out the max tokens it has no effect on the shape size limit only the max batch size, max width, and max height have any effect. I also discovered a base number per every batch size from 1 to 11 which lets you know how far you can slide the max width and height. You will get a red label signaling that you are over the limit and green otherwise. I've created a pull request so hopefully it gets integrated in. Also you can do batch processing instead of just one model at a time. for both the onnx files and trt files. NOTE that your settings for max width height batch size tokens etc will apply to the entire batch. Let me know what you all think.