[X] I have searched the existing issues and checked the recent builds/commits
What would your feature do ?
This is an attempt to speed up --lowvram by taking the model moving out of the forward loop.
The model moving is made asynchronous, by creating a separate CUDA stream dedicated for moving the model, and utilizing CUDA event for synchronoizing back to the default stream.
A lookahead buffer zone is designed, to make the model moving process faster than the forward phase, so in the meanwhile the GPU always has something to do.
Proposed workflow
This is still a prototype, and not all original semantics are followed.
CUDA stream and CUDA events are used. They are CUDA specific. I think there are similar things on IPEX, but nothing similar on DML.
The size of the lookahead buffer is a tweakable settings. A larger buffer would increase the VRAM usage; a smaller buffer would probably make the forward a bit slower. The generation speed gained by larger buffer has a limit.
Is there an existing issue for this?
What would your feature do ?
This is an attempt to speed up --lowvram by taking the model moving out of the forward loop. The model moving is made asynchronous, by creating a separate CUDA stream dedicated for moving the model, and utilizing CUDA event for synchronoizing back to the default stream. A lookahead buffer zone is designed, to make the model moving process faster than the forward phase, so in the meanwhile the GPU always has something to do.
Proposed workflow
This is still a prototype, and not all original semantics are followed. CUDA stream and CUDA events are used. They are CUDA specific. I think there are similar things on IPEX, but nothing similar on DML. The size of the lookahead buffer is a tweakable settings. A larger buffer would increase the VRAM usage; a smaller buffer would probably make the forward a bit slower. The generation speed gained by larger buffer has a limit.
Additional information
No response