Anyone able to get TensorRT running with StreamDiffusion on an RTX 2050 (with 4GB VRAM)?

mikecarnohan commented 5 months ago

I have a build with everything needed for SD+TRT is installed, but when I run SD with TensorRT enabled, I encounter memory issues.

I believe it's possible for these issues to be gotten around by using an ONNX converter (per @yoinked-h on GitHub).

Regular to ONNX conversion can be done on CPU
You need to use torch cpu and launch with --skip-torch-cuda-test --no-half --precision full

You need to remove the cuda imports temporarily from trt.py in /scripts In export_onnx.py you need to replace device.devices to "cpu", devices.dtype to torch.float and remove "with devices.autocast():" |

But at this point, with a vanilla build of SD + TensorRT (10.0), I get the following error:

Using TensorRT...

Compiling TensorRT UNet...
This may take a moment...

Found cached model: engines\stabilityai/sd-turbo--lcm_lora-True--tiny_vae-True--max_batch-2--min_batch-2--mode-img2img\unet.engine.onnx
Found cached model: engines\stabilityai/sd-turbo--lcm_lora-True--tiny_vae-True--max_batch-2--min_batch-2--mode-img2img\unet.engine.opt.onnx
Building TensorRT engine for engines\stabilityai/sd-turbo--lcm_lora-True--tiny_vae-True--max_batch-2--min_batch-2--mode-img2img\unet.engine.opt.onnx: engines\stabilityai/sd-turbo--lcm_lora-True--tiny_vae-True--max_batch-2--min_batch-2--mode-img2img\unet.engine
[libprotobuf WARNING **************************************************************************\externals\protobuf\3.0.0\src\google\protobuf\io\coded_stream.cc:604] Reading dangerously large protocol message.  If the message turns out to be larger than 2147483647 bytes, parsing will be halted for security reasons.  To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_stream.h.
[libprotobuf WARNING **************************************************************************\externals\protobuf\3.0.0\src\google\protobuf\io\coded_stream.cc:81] The total number of bytes read was 1733267037
[I]     Configuring with profiles: [Profile().add('sample', min=(2, 4, 64, 64), opt=(2, 4, 64, 64), max=(2, 4, 64, 64)).add('timestep', min=(2,), opt=(2,), max=(2,)).add('encoder_hidden_states', min=(2, 77, 1024), opt=(2, 77, 1024), max=(2, 77, 1024))]
[W] profileSharing0806 is on by default in TensorRT 10.0. This flag is deprecated and has no effect.
[I] Building engine with configuration:
    Flags                  | [FP16]
    Engine Capability      | EngineCapability.STANDARD
    Memory Pools           | [WORKSPACE: 4095.50 MiB, TACTIC_DRAM: 4095.50 MiB, TACTIC_SHARED_MEMORY: 1024.00 MiB]
    Tactic Sources         | []
    Profiling Verbosity    | ProfilingVerbosity.LAYER_NAMES_ONLY
    Preview Features       | [PROFILE_SHARING_0806]
[E] 2: [virtualMemoryBuffer.cpp::nvinfer1::StdVirtualMemoryBufferImpl::resizePhysical::140] Error Code 2: OutOfMemory (no further information)
[W] Requested amount of GPU memory (1469054976 bytes) could not be allocated. There may not be enough free memory for allocation to succeed.
[W] UNSUPPORTED_STATE: Skipping tactic 0 due to insufficient memory on requested size of 1469054976 detected for tactic 0x0000000000000000.
[E] 2: [virtualMemoryBuffer.cpp::nvinfer1::StdVirtualMemoryBufferImpl::resizePhysical::140] Error Code 2: OutOfMemory (no further information)
[W] Requested amount of GPU memory (1458569216 bytes) could not be allocated. There may not be enough free memory for allocation to succeed.
[W] UNSUPPORTED_STATE: Skipping tactic 0 due to insufficient memory on requested size of 1458569216 detected for tactic 0x0000000000000000.
[E] 2: [virtualMemoryBuffer.cpp::nvinfer1::StdVirtualMemoryBufferImpl::resizePhysical::140] Error Code 2: OutOfMemory (no further information)
[W] Requested amount of GPU memory (1500512256 bytes) could not be allocated. There may not be enough free memory for allocation to succeed.
[W] UNSUPPORTED_STATE: Skipping tactic 0 due to insufficient memory on requested size of 1500512256 detected for tactic 0x0000000000000000.
[E] 2: [virtualMemoryBuffer.cpp::nvinfer1::StdVirtualMemoryBufferImpl::resizePhysical::140] Error Code 2: OutOfMemory (no further information)
[E] 2: [virtualMemoryBuffer.cpp::nvinfer1::StdVirtualMemoryBufferImpl::resizePhysical::140] Error Code 2: OutOfMemory (no further information)
[E] 2: [virtualMemoryBuffer.cpp::nvinfer1::StdVirtualMemoryBufferImpl::resizePhysical::140] Error Code 2: OutOfMemory (no further information)
[E] 2: [virtualMemoryBuffer.cpp::nvinfer1::StdVirtualMemoryBufferImpl::resizePhysical::140] Error Code 2: OutOfMemory (no further information)
[W] Requested amount of GPU memory (325058560 bytes) could not be allocated. There may not be enough free memory for allocation to succeed.

If anyone has gotten further, and can share the way it's working, it would be greatly appreciated. I'd like to be able to use my pre-production laptop to write TouchDesigner code without having to be tethered to my production machine. And even though I don't need to see super high frame rates in those cases, it would be nice to get past 2fps (frame rate without TensorRT on my machine), to know what noise map setting and step scheduling settings will work best when I ship my project files if possible.

mikecarnohan commented 5 months ago

To clarify, running SD without TensorRT works fine. SD-Turbo works fine. And other optimized (tensor-based) models work fine. The memory problem only comes up under SD + TensorRT builds.

yoinked-h commented 5 months ago

RTX 2050 ouch... maybe not possible? [for context, i barely ran this on 6gb vram (1660)] afaik ONNX/TRT has a 4gb (?) minimum free vram so depending on your setup it might literally be impossible

[copypasted from touhouAI]:

for 4GB users, lmao it wont work

safetensors -> ONNX: will OOM, use CPU to convert ✅
- you need to manually modify the script however
ONNX -> TRT: will OOM regardless of engine shape and builder optimizations ❌
- it might succeed? if i didnt have to omit CUBLAS and cuDNN from tacticSources, try yourself
TRT infrencing -> will OOM even with 64x64 ❌
- maybe if engine shape is reduced (and thus engine size), it will be doable as i tested w a 768x768 engine

cumulo-autumn / StreamDiffusion

Anyone able to get TensorRT running with StreamDiffusion on an RTX 2050 (with 4GB VRAM)? #151