Error when combining more than 2 models in fp8

Expected Behavior

Load the 4 models, combine 2 of them, then the other two and from those combinations they will be combined to save the final checkpoint.
Actual Behavior

The 4 models are loaded and combined but when the savecheakpoint node is reached the error occurs imagen imagen
Steps to Reproduce

Error.json
I am using google colab and to start the program I use these parameters: !python main.py --highvram --fp8_e4m3fn-text-enc --fp8_e4m3fn-unet
Debug Logs

Done
https://chest-carried-registrar-restrictions.trycloudflare.com
/content
Total VRAM 15102 MB, total RAM 12979 MB
pytorch version: 2.4.1+cu121
Set vram state to: HIGH_VRAM
Device: cuda:0 Tesla T4 : cudaMallocAsync
Using pytorch cross attention
[Prompt Server] web root: /content/ComUI/web
/usr/local/lib/python3.10/dist-packages/kornia/feature/lightglue.py:44: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  @torch.cuda.amp.custom_fwd(cast_inputs=torch.float32)

Import times for custom nodes:
   0.0 seconds: /content/ComUI/custom_nodes/websocket_image_save.py

Starting server

To see the GUI go to: http://127.0.0.1:8188
fatal: not a git repository (or any of the parent directories): .git
Failed to get ComfyUI version: Command '['git', 'describe', '--tags']' returned non-zero exit status 128.
got prompt
model weight dtype torch.float8_e4m3fn, manual cast: torch.float16
model_type EPS
Using pytorch attention in VAE
Using pytorch attention in VAE
/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
  warnings.warn(
loaded straight to GPU
Requested to load SDXL
Loading 1 new model
loaded completely 0.0 2448.5241737365723 True
model weight dtype torch.float8_e4m3fn, manual cast: torch.float16
model_type EPS
Using pytorch attention in VAE
Using pytorch attention in VAE
loaded straight to GPU
Requested to load SDXL
Loading 1 new model
loaded completely 0.0 2448.5241737365723 True
model weight dtype torch.float8_e4m3fn, manual cast: torch.float16
model_type EPS
Using pytorch attention in VAE
Using pytorch attention in VAE
loaded straight to GPU
Requested to load SDXL
Loading 1 new model
loaded completely 0.0 2448.5241737365723 True
model weight dtype torch.float8_e4m3fn, manual cast: torch.float16
model_type EPS
Using pytorch attention in VAE
Using pytorch attention in VAE
loaded straight to GPU
Requested to load SDXL
Loading 1 new model
loaded completely 0.0 2448.5241737365723 True
Requested to load SDXLClipModel
Loading 1 new model
!!! Exception during processing !!! "mul_cpu_reduced_float" not implemented for 'Float8_e4m3fn'
Traceback (most recent call last):
  File "/content/ComUI/execution.py", line 323, in execute
    output_data, output_ui, has_subgraph = get_output_data(obj, input_data_all, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb)
  File "/content/ComUI/execution.py", line 198, in get_output_data
    return_values = _map_node_over_list(obj, input_data_all, obj.FUNCTION, allow_interrupt=True, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb)
  File "/content/ComUI/execution.py", line 169, in _map_node_over_list
    process_inputs(input_dict, i)
  File "/content/ComUI/execution.py", line 158, in process_inputs
    results.append(getattr(obj, func)(**inputs))
  File "/content/ComUI/CFUI_extras/nodes_model_merging.py", line 242, in save
    save_checkpoint(model, clip=clip, vae=vae, filename_prefix=filename_prefix, output_dir=self.output_dir, prompt=prompt, extra_pnginfo=extra_pnginfo)
  File "/content/ComUI/CFUI_extras/nodes_model_merging.py", line 222, in save_checkpoint
    CFUI.sd.save_checkpoint(output_checkpoint, model, clip, vae, clip_vision, metadata=metadata, extra_keys=extra_keys)
  File "/content/ComUI/CFUI/sd.py", line 680, in save_checkpoint
    load_models.append(clip.load_model())
  File "/content/ComUI/CFUI/sd.py", line 157, in load_model
    model_management.load_model_gpu(self.patcher)
  File "/content/ComUI/CFUI/model_management.py", line 559, in load_model_gpu
    return load_models_gpu([model])
  File "/content/ComUI/CFUI/model_management.py", line 545, in load_models_gpu
    cur_loaded_model = loaded_model.model_load(lowvram_model_memory, force_patch_weights=force_patch_weights)
  File "/content/ComUI/CFUI/model_management.py", line 327, in model_load
    raise e
  File "/content/ComUI/CFUI/model_management.py", line 323, in model_load
    self.real_model = self.model.patch_model(device_to=patch_model_to, lowvram_model_memory=lowvram_model_memory, load_weights=load_weights, force_patch_weights=force_patch_weights)
  File "/content/ComUI/CFUI/model_patcher.py", line 427, in patch_model
    self.load(device_to, lowvram_model_memory=lowvram_model_memory, force_patch_weights=force_patch_weights, full_load=full_load)
  File "/content/ComUI/CFUI/model_patcher.py", line 393, in load
    self.patch_weight_to_device(weight_key, device_to=device_to)
  File "/content/ComUI/CFUI/model_patcher.py", line 323, in patch_weight_to_device
    out_weight = CFUI.lora.calculate_weight(self.patches[key], temp_weight, key)
  File "/content/ComUI/CFUI/lora.py", line 417, in calculate_weight
    v = (calculate_weight(v[1:], v[0].clone(), key, intermediate_dtype=intermediate_dtype), )
  File "/content/ComUI/CFUI/lora.py", line 414, in calculate_weight
    weight *= strength_model
RuntimeError: "mul_cpu_reduced_float" not implemented for 'Float8_e4m3fn'

Prompt executed in 218.64 seconds
fatal: not a git repository (or any of the parent directories): .git
Failed to get ComfyUI version: Command '['git', 'describe', '--tags']' returned non-zero exit status 128.
got prompt
Requested to load SDXLClipModel
Loading 1 new model
!!! Exception during processing !!! "mul_cpu_reduced_float" not implemented for 'Float8_e4m3fn'
Traceback (most recent call last):
  File "/content/ComUI/execution.py", line 323, in execute
    output_data, output_ui, has_subgraph = get_output_data(obj, input_data_all, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb)
  File "/content/ComUI/execution.py", line 198, in get_output_data
    return_values = _map_node_over_list(obj, input_data_all, obj.FUNCTION, allow_interrupt=True, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb)
  File "/content/ComUI/execution.py", line 169, in _map_node_over_list
    process_inputs(input_dict, i)
  File "/content/ComUI/execution.py", line 158, in process_inputs
    results.append(getattr(obj, func)(**inputs))
  File "/content/ComUI/CFUI_extras/nodes_model_merging.py", line 242, in save
    save_checkpoint(model, clip=clip, vae=vae, filename_prefix=filename_prefix, output_dir=self.output_dir, prompt=prompt, extra_pnginfo=extra_pnginfo)
  File "/content/ComUI/CFUI_extras/nodes_model_merging.py", line 222, in save_checkpoint
    CFUI.sd.save_checkpoint(output_checkpoint, model, clip, vae, clip_vision, metadata=metadata, extra_keys=extra_keys)
  File "/content/ComUI/CFUI/sd.py", line 680, in save_checkpoint
    load_models.append(clip.load_model())
  File "/content/ComUI/CFUI/sd.py", line 157, in load_model
    model_management.load_model_gpu(self.patcher)
  File "/content/ComUI/CFUI/model_management.py", line 559, in load_model_gpu
    return load_models_gpu([model])
  File "/content/ComUI/CFUI/model_management.py", line 545, in load_models_gpu
    cur_loaded_model = loaded_model.model_load(lowvram_model_memory, force_patch_weights=force_patch_weights)
  File "/content/ComUI/CFUI/model_management.py", line 327, in model_load
    raise e
  File "/content/ComUI/CFUI/model_management.py", line 323, in model_load
    self.real_model = self.model.patch_model(device_to=patch_model_to, lowvram_model_memory=lowvram_model_memory, load_weights=load_weights, force_patch_weights=force_patch_weights)
  File "/content/ComUI/CFUI/model_patcher.py", line 427, in patch_model
    self.load(device_to, lowvram_model_memory=lowvram_model_memory, force_patch_weights=force_patch_weights, full_load=full_load)
  File "/content/ComUI/CFUI/model_patcher.py", line 393, in load
    self.patch_weight_to_device(weight_key, device_to=device_to)
  File "/content/ComUI/CFUI/model_patcher.py", line 323, in patch_weight_to_device
    out_weight = CFUI.lora.calculate_weight(self.patches[key], temp_weight, key)
  File "/content/ComUI/CFUI/lora.py", line 417, in calculate_weight
    v = (calculate_weight(v[1:], v[0].clone(), key, intermediate_dtype=intermediate_dtype), )
  File "/content/ComUI/CFUI/lora.py", line 414, in calculate_weight
    weight *= strength_model
RuntimeError: "mul_cpu_reduced_float" not implemented for 'Float8_e4m3fn'

Prompt executed in 1.03 seconds
fatal: not a git repository (or any of the parent directories): .git
Failed to get ComfyUI version: Command '['git', 'describe', '--tags']' returned non-zero exit status 128.

Stopped server
Exception ignored in atexit callback: <function dump_compile_times at 0x7fe070cf5750>
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/utils.py", line 335, in dump_compile_times
    log.info(compile_times(repr="str", aggregate=True))
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/utils.py", line 322, in compile_times
    out += tabulate(rows, headers=("Function", "Runtimes (s)"))
  File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/utils.py", line 127, in tabulate
    import tabulate
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1002, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 945, in _find_spec
  File "<frozen importlib._bootstrap>", line 750, in find_spec
KeyboardInterrupt:
Other

No response
comfyanonymous / ComfyUI

Error when combining more than 2 models in fp8 #4989

Expected Behavior

Actual Behavior

Steps to Reproduce

Debug Logs

Other