Performance degradation when loading loras

Expected Behavior

n/a

Actual Behavior

n/a

Steps to Reproduce

Use the default workflow and add some Load LoRA node to it.
Monitor CPU usage and execution time.

Debug Logs

n/a

Other

Recently I found a increase in CPU usage when loading lora, especially when applying three or more loras. After some git bisect debugging I'm sure the reason is this commit 67158994a4356d0ec54aaf3bbc5619c6c119f540

Before

After, with same workload

Tracing result shows a lot of cpu seconds are consumed by this line https://github.com/comfyanonymous/ComfyUI/blob/67158994a4356d0ec54aaf3bbc5619c6c119f540/comfy/model_management.py#L851

and before this commit, the cast_to_device function was basically calling Tensor.to() which I think will be faster than the copy_ method.

This regression significantly affects workloads with multiple LoRAs, especially on systems where CPU resources are already constrained.

Edit: I'm using --gpu-only option, just in case it's related

comfyanonymous / ComfyUI