Open chrismathew99 opened 8 months ago
The proposed feature aims to improve the performance of the --lowvram
option by introducing asynchronous model moving using a separate CUDA stream and lookahead buffer. The reasoning behind this solution is to minimize GPU idle time by ensuring that the model is moved to the GPU ahead of the forward pass, allowing for continuous computation. The lookahead buffer serves as a preloading zone for model states, which can be adjusted for a balance between VRAM usage and generation speed.
To implement this feature, the following steps should be taken across the various files:
modules/devices.py:
AsyncModelMover
class to manage the CUDA stream and events.AsyncModelMover
and add functions to handle asynchronous model moving and synchronization.torch_gc
function to synchronize the new stream.modules/img2img.py:
modules/lowvram.py:
send_me_to_gpu
function to use the new stream and events.setup_for_low_vram
function to preload models into the lookahead buffer.modules/txt2img.py:
StableDiffusionProcessingTxt2Img
class to handle asynchronous model moving and lookahead buffer integration.modules/processing.py:
sample
method in the processing classes to handle asynchronous model moving and lookahead buffer.init
and close
methods to manage the new asynchronous logic and resources.modules/sd_models.py:
By implementing these changes, the feature should provide a speedup for the --lowvram
option by ensuring that the GPU has continuous work, thus reducing idle time and potentially improving overall generation speed.
Click here to create a Pull Request with the proposed solution
Files used for this task:
What would your feature do ? This is an attempt to speed up --lowvram by taking the model moving out of the forward loop. The model moving is made asynchronous, by creating a separate CUDA stream dedicated for moving the model, and utilizing CUDA event for synchronoizing back to the default stream. A lookahead buffer zone is designed, to make the model moving process faster than the forward phase, so in the meanwhile the GPU always has something to do.
Proposed workflow This is still a prototype, and not all original semantics are followed. CUDA stream and CUDA events are used. They are CUDA specific. I think there are similar things on IPEX, but nothing similar on DML. The size of the lookahead buffer is a tweakable settings. A larger buffer would increase the VRAM usage; a smaller buffer would probably make the forward a bit slower. The generation speed gained by larger buffer has a limit.