Different FPS with same model and parameters.

Difference is around 5 FPS, tested with sd_turbo & 1 batch size. On inspect trt engine found that faster engine have different info inside. Also there is difference size slower engine is 2053166 kb, faster engine is 2079951 kb, and vae engine is 100 kb difference also. Result was achieved only once, since that all engines generate slower. Tried even with clean venv.

slower engine:

"Layers": [{
  "Name": "/conv_in/Cast",
  "LayerType": "NoOp",
  "Inputs": [
  {
    "Name": "sample",
    "Location": "Device",
    "Dimensions": [1,4,64,64],
    "Format/Datatype": "Row major linear FP32"
  }],
 "Outputs": [
  {
    "Name": "/conv_in/Cast_output_0",
    "Location": "Device",
    "Dimensions": [1,4,64,64],
    "Format/Datatype": "Row major linear FP32"
  }],
  "TacticValue": "0x0000000000000000",
  "StreamId": 0,
  "Metadata": ""
}

faster engine:

{"Layers": [{
  "Name": "/conv_in/Cast",
  "LayerType": "Reformat",
  "Inputs": [
  {
    "Name": "sample",
    "Location": "Device",
    "Dimensions": [1,4,64,64],
    "Format/Datatype": "Row major linear FP32"
  }],  "Outputs": [
  {
    "Name": "/conv_in/Cast_output_0",
    "Location": "Device",
    "Dimensions": [1,4,64,64],
    "Format/Datatype": "Channel major FP16 format where channel % 8 == 0"
  }],
  "ParameterType": "Reformat",
  "Origin": "CAST",
  "TacticValue": "0x00000000000003e8",
  "StreamId": 0,
  "Metadata": ""
}

cumulo-autumn / StreamDiffusion

Different FPS with same model and parameters. #154