Potential solution

The proposed feature aims to improve the performance of the --lowvram option by introducing asynchronous model moving using a separate CUDA stream and lookahead buffer. The reasoning behind this solution is to minimize GPU idle time by ensuring that the model is moved to the GPU ahead of the forward pass, allowing for continuous computation. The lookahead buffer serves as a preloading zone for model states, which can be adjusted for a balance between VRAM usage and generation speed.

How to implement

To implement this feature, the following steps should be taken across the various files:

modules/devices.py:
- Import CUDA modules and define the AsyncModelMover class to manage the CUDA stream and events.
- Instantiate the AsyncModelMover and add functions to handle asynchronous model moving and synchronization.
- Update the torch_gc function to synchronize the new stream.
modules/img2img.py:
- Integrate the asynchronous model mover by modifying the processing functions to use the new CUDA stream and events.
- Implement the lookahead buffer within the image processing pipeline.
- Ensure synchronization and error handling for the asynchronous operations.
modules/lowvram.py:
- Create a dedicated CUDA stream and integrate CUDA events for synchronization.
- Modify the send_me_to_gpu function to use the new stream and events.
- Update the setup_for_low_vram function to preload models into the lookahead buffer.
modules/txt2img.py:
- Modify the StableDiffusionProcessingTxt2Img class to handle asynchronous model moving and lookahead buffer integration.
- Adjust the processing loop to initiate model moves ahead of time and wait for the completion event.
- Add error handling and synchronization to ensure the correct model state is used for each forward pass.
modules/processing.py:
- Modify the sample method in the processing classes to handle asynchronous model moving and lookahead buffer.
- Update the init and close methods to manage the new asynchronous logic and resources.
- Test the changes to ensure improved performance and stability.
modules/sd_models.py:
- Create a new CUDA stream for model moving and integrate CUDA events for synchronization.
- Update model loading and unloading logic to work with the asynchronous stream and lookahead buffer.
- Add configuration options for the lookahead buffer size and thoroughly test the new logic.

By implementing these changes, the feature should provide a speedup for the --lowvram option by ensuring that the GPU has continuous work, thus reducing idle time and potentially improving overall generation speed.

Click here to create a Pull Request with the proposed solution

Files used for this task:

Changes on modules/devices.py

To implement the asynchronous model moving logic using CUDA streams and events in the `modules/devices.py` file, we need to add functionality to manage the new CUDA stream for model moving and ensure synchronization with the default stream. Here's a practical proposal to modify the `modules/devices.py` file: 1. Import the necessary CUDA modules from PyTorch: ```python from torch.cuda import Stream, Event ``` 2. Define a new class to manage the asynchronous model moving stream and its synchronization events: ```python class AsyncModelMover: def __init__(self): self.stream = Stream() self.events = {} def record_event(self, event_key): event = Event(blocking=False) event.record(self.stream) self.events[event_key] = event def wait_for_event(self, event_key): if event_key in self.events: self.events[event_key].wait() def synchronize(self): self.stream.synchronize() ``` 3. Instantiate the `AsyncModelMover` class globally within the file: ```python async_model_mover = AsyncModelMover() ``` 4. Add functions to manage the asynchronous stream and events: ```python def move_model_to_device_async(model, device): with torch.cuda.stream(async_model_mover.stream): model.to(device) def record_model_move_event(event_key): async_model_mover.record_event(event_key) def wait_for_model_move_event(event_key): async_model_mover.wait_for_event(event_key) def synchronize_model_move(): async_model_mover.synchronize() ``` 5. Update the `torch_gc` function to also synchronize the asynchronous model mover stream: ```python def torch_gc(): # ... existing code ... # Synchronize the asynchronous model mover stream async_model_mover.synchronize() ``` 6. Ensure that any other functions that interact with the device or perform model moving operations are updated to use the new asynchronous stream and synchronization methods where appropriate. By making these changes, we introduce an asynchronous model moving mechanism that can be utilized by other parts of the codebase to improve performance when using the `--lowvram` option. The `AsyncModelMover` class manages a dedicated CUDA stream for moving models and uses CUDA events to synchronize with the default stream, ensuring that the GPU can continue processing without unnecessary stalls.

Changes on modules/img2img.py

The `img2img.py` module is responsible for handling image-to-image transformations, including various modes such as sketching, inpainting, and batch processing. To integrate the asynchronous model mover and lookahead buffer mechanism, you would need to modify the processing functions to ensure they can operate with the new asynchronous model moving logic. Here's a practical proposal for integrating the asynchronous model mover into the `img2img.py` module: 1. **Identify Points of Model Usage**: Determine where in the `img2img.py` code the model is being loaded and used. This typically happens during the call to `process_images(p)` or `modules.scripts.scripts_img2img.run(p, *args)`. 2. **Asynchronous Model Moving**: Introduce a new function or modify existing ones to move the model to and from the GPU asynchronously. This will involve creating a separate CUDA stream for model moving and using CUDA events to synchronize with the default stream. 3. **Lookahead Buffer Integration**: Implement a lookahead buffer that preloads the model onto the GPU before it is needed for the next batch of image processing. This buffer should be large enough to ensure that the GPU is not idle waiting for the model but not so large that it significantly increases VRAM usage. 4. **Synchronization**: Ensure that the image processing pipeline waits for the model to be fully loaded on the GPU before starting the forward pass. This can be done by waiting on the CUDA event that signals the completion of the model moving. 5. **Error Handling**: Add error handling for cases where the asynchronous model moving fails or takes longer than expected, which could cause the forward pass to attempt to start without the model being ready. 6. **Testing**: Thoroughly test the new asynchronous model moving logic with various batch sizes, image sizes, and model complexities to ensure that it works correctly and provides the expected speedup. 7. **Documentation**: Update the documentation within the `img2img.py` module to explain how the asynchronous model mover works and any new parameters or settings introduced. 8. **User Feedback**: Consider adding logging or progress indicators to inform the user about the status of the model moving and processing, especially since these operations are now asynchronous. 9. **Compatibility**: Ensure that the changes made for asynchronous model moving are compatible with the rest of the codebase, including any scripts or additional modules that rely on `img2img.py`. 10. **Performance Metrics**: After implementation, measure the performance improvements in terms of processing speed and VRAM usage to validate the effectiveness of the asynchronous model mover and lookahead buffer. Remember that this is a high-level proposal, and the actual implementation will require a detailed understanding of the existing codebase and careful coding to ensure that the asynchronous operations are correctly managed and synchronized.

Changes on modules/lowvram.py

To implement the asynchronous model moving logic using CUDA streams and events, and integrate the lookahead buffer mechanism in the `modules/lowvram.py` file, we need to modify the `setup_for_low_vram` function and potentially other parts of the module. Here's a practical proposal for how to approach this task: 1. **Create a dedicated CUDA stream for model moving**: We will need to create a new CUDA stream that will be responsible for moving the model between the CPU and GPU. This stream will be used to perform the model moving operations asynchronously. 2. **Integrate CUDA events for synchronization**: To ensure that the model is ready on the GPU when needed, we will use CUDA events to synchronize between the model moving stream and the default stream (which is used for the forward pass). 3. **Implement the lookahead buffer**: The lookahead buffer is a mechanism that will allow us to move the model to the GPU before it is needed for the forward pass. This buffer will be a queue of model states that are pre-loaded onto the GPU. 4. **Modify the `send_me_to_gpu` function**: The `send_me_to_gpu` function will need to be updated to use the new CUDA stream and events for asynchronous execution. It should also interact with the lookahead buffer to ensure that the correct model state is loaded when needed. 5. **Update the model moving logic**: The logic that moves the model between the CPU and GPU will need to be updated to work with the new asynchronous stream and lookahead buffer. This includes handling the pre-loading of model states into the buffer and ensuring that the correct state is used for each forward pass. 6. **Test and debug**: After implementing these changes, thorough testing and debugging will be necessary to ensure that the new asynchronous model moving logic works correctly and provides the expected performance improvements. Here's a rough outline of the code changes that might be needed: ```python # At the top of the file, import the necessary CUDA functions import torch.cuda as cuda # Create a dedicated CUDA stream for model moving model_moving_stream = cuda.Stream() # Create a lookahead buffer (this is a conceptual example) lookahead_buffer = [] # Modify the send_me_to_gpu function to use the new stream and events def send_me_to_gpu(module, _): global module_in_gpu module = parents.get(module, module) if module_in_gpu == module: return # Use the model moving stream for asynchronous execution with cuda.stream(model_moving_stream): if module_in_gpu is not None: # Move the previous module to CPU asynchronously module_in_gpu.to(cpu, non_blocking=True) # Move the current module to GPU asynchronously module.to(devices.device, non_blocking=True) module_in_gpu = module # Synchronize with the default stream using an event event = cuda.Event() model_moving_stream.record_event(event) event.wait(cuda.default_stream) # Update the setup_for_low_vram function to pre-load models into the lookahead buffer # and to use the new send_me_to_gpu function # ... # Additional logic for managing the lookahead buffer and ensuring the correct model state # is used for each forward pass will also be needed. # ... ``` Please note that this is a high-level proposal and the actual implementation may require additional considerations, such as error handling, managing the size of the lookahead buffer, and ensuring compatibility with the rest of the system. Additionally, thorough testing is crucial to validate the functionality and performance of the asynchronous model moving logic.

Changes on modules/txt2img.py

To ensure that the `txt2img` function in the `modules/txt2img.py` file works with the asynchronous model mover and lookahead buffer, the following steps should be taken: 1. **Integrate Asynchronous Model Moving Logic:** - Modify the `StableDiffusionProcessingTxt2Img` class to accept a new parameter that represents the asynchronous model mover. This mover should be responsible for handling the model's movement between CPU and GPU asynchronously. - Ensure that the mover is initialized with the correct CUDA stream and that it is capable of using CUDA events for synchronization. 2. **Lookahead Buffer Integration:** - Implement a lookahead buffer within the `StableDiffusionProcessingTxt2Img` class. This buffer will store a certain number of steps ahead of the current processing step, allowing the GPU to work continuously without waiting for the model to be moved. - The buffer size should be configurable, and the implementation should take into account the trade-off between VRAM usage and generation speed. 3. **Processing Loop Modifications:** - Adjust the processing loop within the `process_images` function (or wherever the main generation loop is located) to work with the asynchronous model mover. This includes initiating model moves ahead of time and waiting for the CUDA event that signals the completion of the move when necessary. - Ensure that the loop can handle the lookahead buffer by pre-fetching the required model states into the buffer and using them as needed. 4. **Error Handling and Synchronization:** - Add error handling for any asynchronous operations to ensure that any failures in model moving are caught and handled gracefully. - Ensure proper synchronization at the end of the processing to make sure that all asynchronous operations have completed before the function returns the results. 5. **Testing and Validation:** - After implementing the changes, thoroughly test the `txt2img` function to ensure that it works correctly with the new asynchronous model mover and lookahead buffer. - Validate that the generation speed has improved without compromising the quality of the generated images. 6. **Documentation and Comments:** - Update the documentation within the `txt2img.py` file to reflect the new asynchronous behavior and the purpose of the lookahead buffer. - Add comments throughout the modified code to explain the logic and the flow of asynchronous operations for future maintainers. By following these steps, the `txt2img` function should be able to leverage the benefits of asynchronous model moving and the lookahead buffer to improve performance in low VRAM scenarios.

Changes on modules/processing.py

The `modules/processing.py` file contains the core logic for image processing in the Stable Diffusion model. It includes classes and functions for handling the image generation pipeline, including color correction, overlay application, mask creation, and the main processing loop for both text-to-image (`StableDiffusionProcessingTxt2Img`) and image-to-image (`StableDiffusionProcessingImg2Img`) generation. To integrate the asynchronous model moving and lookahead buffer mechanism, you would need to modify the sampling and processing functions within the `StableDiffusionProcessingTxt2Img` and `StableDiffusionProcessingImg2Img` classes. Specifically, you would need to: 1. Modify the `sample` method in both classes to handle asynchronous model moving. This would involve creating a separate CUDA stream for model moving and using CUDA events to synchronize with the default stream. 2. Introduce a lookahead buffer within the sampling method. This buffer would store a certain number of samples ahead of time, allowing the GPU to continue working while the model is being moved asynchronously. 3. Ensure proper synchronization between the CUDA streams to prevent race conditions and ensure that the model is correctly moved and available for the forward pass when needed. 4. Update the `init` method to accommodate the new asynchronous logic and buffer mechanism, ensuring that all necessary resources are initialized correctly. 5. Adjust the `close` method to properly clean up the CUDA streams and any other resources related to the asynchronous model moving. 6. Throughout the code, ensure that any device-specific operations (such as tensor creation and manipulation) are performed on the correct CUDA stream. 7. Test the changes thoroughly to ensure that the asynchronous model moving and lookahead buffer mechanism improve performance without introducing any new issues or regressions. It's important to note that these changes would be quite complex and require a deep understanding of CUDA, PyTorch's stream and event system, and the Stable Diffusion model's architecture. Additionally, thorough testing would be necessary to ensure the stability and correctness of the implementation.

Changes on modules/sd_models.py

To implement the asynchronous model moving logic using CUDA streams and events, and integrate the lookahead buffer mechanism in the `modules/sd_models.py` file, you would need to modify several parts of the code. Here's a practical proposal for how to approach this task: 1. **Create a new CUDA stream for model moving**: You will need to create a new CUDA stream dedicated to moving the model between the CPU and GPU. This can be done using `torch.cuda.Stream()`. 2. **Integrate CUDA events for synchronization**: CUDA events will be used to synchronize the default stream with the model moving stream. You can create an event using `torch.cuda.Event()` and then wait for the event to complete using `.wait()`. 3. **Implement the lookahead buffer**: The lookahead buffer is a mechanism that allows the asynchronous stream to move the model ahead of time before it's needed by the forward pass. You will need to design a buffer that can hold a certain number of model states and manage the logic for preloading and using these states. 4. **Modify the model loading and unloading logic**: The existing functions for loading and unloading the model, such as `load_model_weights`, `get_checkpoint_state_dict`, `send_model_to_device`, and `send_model_to_cpu`, will need to be updated to work with the new asynchronous stream and lookahead buffer. 5. **Ensure proper synchronization**: Before any forward pass, you must ensure that the model has been fully moved to the GPU and is ready for computation. This may involve waiting on a CUDA event that signals the completion of the model move. 6. **Handle errors and exceptions**: Ensure that any errors during the asynchronous operations are properly caught and handled. This includes handling cases where the model moving might not be faster than the forward pass, and the GPU ends up idle. 7. **Configuration and testing**: Add configuration options for the lookahead buffer size and other relevant settings. Thoroughly test the new asynchronous model moving logic to ensure it works correctly and provides the expected speedup. Here's a rough sketch of how some of these changes might look in code: ```python # At the beginning of the file, import necessary CUDA modules import torch.cuda # Create a new CUDA stream for model moving model_moving_stream = torch.cuda.Stream() # Create an event for synchronization model_moved_event = torch.cuda.Event() # Modify the send_model_to_device function to use the new stream def send_model_to_device(m): with torch.cuda.stream(model_moving_stream): m.to(shared.device) # Record an event when the model move is complete model_moved_event.record() # Modify the load_model_weights function to wait for the model to be moved def load_model_weights(model, checkpoint_info: CheckpointInfo, state_dict, timer): # ... existing code ... # Before applying weights, wait for the model to be on the GPU model_moved_event.wait() # ... existing code ... # Implement the lookahead buffer mechanism # This is a simplified example and would need to be integrated into the existing logic lookahead_buffer = collections.deque(maxlen=lookahead_buffer_size) def preload_model_states(): # Logic to asynchronously load future model states into the lookahead buffer pass def get_next_model_state(): # Wait for the next model state to be ready and return it model_moved_event.wait() return lookahead_buffer.popleft() ``` Please note that this is a high-level overview and not complete code. You will need to carefully integrate these changes into the existing codebase, ensuring that all interactions with the model are properly synchronized and that the lookahead buffer is managed correctly. Additionally, you will need to handle the cleanup of CUDA streams and events to avoid resource leaks.

chrismathew99 / automatic1111

[WIP] Asynchronous model mover for lowvram #10

Potential solution

How to implement