feat: latest torch/comfyui; perf improvments; fix: SSL cert issues

tazlin commented 1 month ago

New Features/Updates

Updated PyTorch version to 2.5.0 and the CUDA default to cu124.
Added very_fast_disk_mode configuration option for concurrent model loading.
- The default is false.
- This causes all workers with very_fast_disk_mode: false to only load one model at a time when it is being explicitly preloaded. There are some cases where it still might attempt to load more than one but it should be far less often.
Updated horde dependencies to the latest versions.
- https://github.com/Haidra-Org/hordelib/pull/342
- https://github.com/Haidra-Org/hordelib/pull/343

Fixes and Improvements

AMD support has been improved with the use of a more recent version of flash_attn.
- See https://github.com/Haidra-Org/horde-worker-reGen/pull/333 for more info.
Improved the stability and performance of high_performance_mode.
- Jobs which are expected to be brief now do not block job pops. Additionally, less time is spent in general waiting if this mode is on.
Improved the stability and performance of max_threads values greater than one.
- xx90 series cards will likely see a large improvement with max_threads: 2 and a bit of tuning.
- Important: You almost certainly will want high_performance_mode if you have a xx90 card.
- Note that cascade and flux, as well as high_memory_mode can still lead to additional instability with threads at 2.
- xx80 series cards may benefit from max_threads: 2 in SD1.5-only setups without controlnets/post-processing or in other conservative configurations.
Improved process management with enhanced deadlock detection and handling.
- Particularly, hang ups where all of the process were available and waiting should be more readily detected and corrected.
Optimized image processing by using rawpng directly, reducing redundant operations.
- The repeated call to PIL.Image.open(...) was highly inefficient, especially for very large images.
- The already encoded png sent from ComfyUI is used instead
Added SSL context using certifi to resolve certificate resolution issues.
Updated documentation to reflect changes in CUDA version and new configuration options.
Fixed a bug where the download_models.py would not exit if the compvis models failed to download. This would cause the worker to crash unexpectedly as it expects the models to be available on worker start.
The docker image scheme has been substantially reworked. See the developer changes below for more information.
- As a reminder, cloud systems such as runpod.io and vast.ai have good support for deploying docker images. See the new Dockerfiles/README.md for information on configuring these images.

Developer changes

feat: add ROCm and CUDA Dockerfiles with entrypoint and setup scripts

Made possible by the invaluable efforts of @HPPinata.
Introduced multi-stage Dockerfiles for CUDA and ROCm, supporting customizable build arguments and pip caching.
Added new entrypoint script to handle environment setup and execution for both CUDA and ROCm environments.
Documented the usage of new Dockerfiles, including build and run instructions, configuration options, and customization possibilities.
Updated advanced README to include options for mounting configuration files and model directories.
Support for docker compose (https://github.com/Haidra-Org/horde-worker-reGen/pull/328)
333
334
335

To see the specific tasks where the Asana app for GitHub is being used, see below:
- https://app.asana.com/0/0/1208486016776942

tazlin commented 1 month ago

@CodiumAI-Agent /describe

CodiumAI-Agent commented 1 month ago

Title

(Describe updated until commit https://github.com/Haidra-Org/horde-worker-reGen/commit/f1e88afdbd9c40295a4608a1d434bc8e79022852)

feat: latest torch/comfyui; perf improvments; fix: SSL cert issues

User description

New Features/Updates

Updated PyTorch version to 2.4.1 and the CUDA default to cu124.
Added very_fast_disk_mode configuration option for concurrent model loading.
- The default is false.
- This causes all workers with very_fast_disk_mode: false to only load one model at a time when it is being explicitly preloaded. There are some cases where it still might attempt to load more than one but it should be far less often.
Updated horde dependencies to the latest versions.
- https://github.com/Haidra-Org/hordelib/pull/342
- https://github.com/Haidra-Org/hordelib/pull/343

Fixes and Improvements

Improved the stability and performance of high_performance_mode.
- Jobs which are expected to be brief now do not block job pops. Additionally, less time is spent in general waiting if this mode is on.
Improved the stability and performance of max_threads values greater than one.
- xx90 series cards will likely see a large improvement with max_threads: 2 and a bit of tuning.
- Important: You almost certainly will want high_performance_mode if you have a xx90 card.
- Note that cascade and flux, as well as high_memory_mode can still lead to additional instability with threads at 2.
- xx80 series cards may benefit from max_threads: 2 in SD1.5-only setups without controlnets/post-processing or in other conservative configurations.
Improved process management with enhanced deadlock detection and handling.
- Particularly, hang ups where all of the process were available and waiting should be more readily detected and corrected.
Optimized image processing by using rawpng directly, reducing redundant operations.
- The repeated call to PIL.Image.open(...) was highly inefficient, especially for very large images.
- The already encoded png sent from ComfyUI is used instead
Added SSL context using certifi to resolve certificate resolution issues.
Updated documentation to reflect changes in CUDA version and new configuration options.
Fixed a bug where the download_models.py would not exit if the compvis models failed to download. This would cause the worker to crash unexpectedly as it expects the models to be available on worker start.

Developer changes

feat: add ROCm and CUDA Dockerfiles with entrypoint and setup scripts

Made possible by the invaluable efforts of @HPPinata.
Introduced multi-stage Dockerfiles for CUDA and ROCm, supporting customizable build arguments and pip caching.
Added new entrypoint script to handle environment setup and execution for both CUDA and ROCm environments.
Documented the usage of new Dockerfiles, including build and run instructions, configuration options, and customization possibilities.
Updated advanced README to include options for mounting configuration files and model directories.
Support for docker compose (https://github.com/Haidra-Org/horde-worker-reGen/pull/328)

To see the specific tasks where the Asana app for GitHub is being used, see below:
- https://app.asana.com/0/0/1208486016776942

PR Type

Enhancement, Bug fix, Documentation, Tests

Description

Updated PyTorch version to 2.5.0 and CUDA default to cu124.
Added very_fast_disk_mode configuration option for concurrent model loading.
Improved process management with enhanced deadlock detection and handling.
Added SSL context using certifi to resolve certificate issues.
Updated Dockerfiles for CUDA and ROCm environments with multi-stage builds.
Enhanced logging and error handling in various modules.
Updated documentation to reflect changes in CUDA version and new configuration options.

Changes walkthrough 📝

Relevant files

Enhancement

11 files

__init__.py `Bump version to 9.2.0` horde_worker_regen/__init__.py - Updated version from 9.0.7 to 9.2.0.	+1/-1
amd_go_fast.py `Enhance scaled dot product attention hijack` horde_worker_regen/amd_go_fast/amd_go_fast.py Modified `sdpa_hijack` function to include `enable_gqa` parameter. Increased query shape threshold from 128 to 256.	+5/-2
data_model.py `Add very_fast_disk_mode configuration option` horde_worker_regen/bridge_data/data_model.py - Added `very_fast_disk_mode` configuration option.	+3/-0
load_env_vars.py `Load large models environment variable setup` horde_worker_regen/load_env_vars.py - Added logic to set environment variable for loading large models.	+5/-0
inference_process.py `Improve model preloading and image processing` horde_worker_regen/process_management/inference_process.py Improved logging for model preloading. Optimized image processing by using `rawpng` directly.	+4/-6
process_manager.py `Enhance process management and SSL handling` horde_worker_regen/process_management/process_manager.py Added SSL context using `certifi`. Enhanced process management with better deadlock detection. Added properties for RAM in megabytes and gigabytes.	+210/-26
entrypoint.sh `Add Docker entrypoint script for setup and execution` Dockerfiles/entrypoint.sh Added entrypoint script for Docker setup. Handles environment setup and worker execution.	+58/-0
setup_rocm.sh `Add ROCm environment setup script` Dockerfiles/setup_rocm.sh Added script to uninstall NVIDIA-specific packages in ROCm environment.	+5/-0
Dockerfile.cuda `Add Dockerfile for CUDA environment` Dockerfiles/Dockerfile.cuda Added Dockerfile for CUDA environment setup. Supports multi-stage builds and dependency installation.	+62/-0
Dockerfile.rocm `Add Dockerfile for ROCm environment` Dockerfiles/Dockerfile.rocm Added Dockerfile for ROCm environment setup. Supports multi-stage builds and dependency installation.	+69/-0
bridgeData_template.yaml `Update bridgeData template with new option` bridgeData_template.yaml - Added `very_fast_disk_mode` option to template.	+4/-0

Tests

1 files

test_horde_dep_updates.py

Add logging for torch version check skips

tests/test_horde_dep_updates.py - Added logger warnings for skipping torch version checks.

+14/-0

Documentation

1 files

README.md

Add Docker usage guide for CUDA and ROCm

Dockerfiles/README.md

Added detailed guide for using Dockerfiles with CUDA and ROCm.

Includes setup, configuration, and troubleshooting instructions.

+230/-0

Dependencies

2 files

requirements.txt `Update dependencies to latest versions` requirements.txt Updated torch version to 2.5.0. Updated horde dependencies to latest versions.	+8/-6
.pre-commit-config.yaml `Update pre-commit hooks` .pre-commit-config.yaml - Updated pre-commit hooks to latest versions.	+8/-8

💡 PR-Agent usage: Comment /help "your question" on any pull request to receive relevant information

CIB commented 3 weeks ago

The docker instructions aren't working for me (Arch Linux / nvidia GPU)

git clone --sparse --branch raw-png https://github.com/Haidra-Org/horde-worker-reGen.git horde-worker-reGen-png
cd horde-worker-reGen-png/
git sparse-checkout set --no-cone Dockerfiles /bridgeData_template.yaml
docker compose -f Dockerfiles/compose.cuda.yaml build --pull
docker compose -f Dockerfiles/compose.cuda.yaml up -dV

reGen  | [notice] A new release of pip is available: 24.0 -> 24.3.1
reGen  | [notice] To update, run: pip install --upgrade pip
reGen  | 2024-10-30 18:40:57.711 | DEBUG    | horde_worker_regen.load_env_vars:load_env_vars_from_config:68 - Using default AI Horde URL.
reGen  | 2024-10-30 18:40:57.740 | DEBUG    | horde_sdk:_dev_env_var_warnings:42 - AIWORKER_CACHE_HOME is ./models/.
reGen  | 2024-10-30 18:40:59.707 | DEBUG    | horde_model_reference.legacy.classes.legacy_converters:write_out_records:554 - Converted database written to: /horde-worker-reGen/models/horde_model_reference/stable_diffusion.json
reGen  | 2024-10-30 18:41:00.050 | DEBUG    | horde_model_reference.legacy.classes.legacy_converters:write_out_records:554 - Converted database written to: /horde-worker-reGen/models/horde_model_reference/stable_diffusion.json
reGen  | 2024-10-30 18:41:00.061 | WARNING  | horde_worker_regen.bridge_data.data_model:validate_performance_modes:162 - High memory mode is enabled. You may experience performance issues with more than one thread.
reGen  | 2024-10-30 18:41:00.061 | WARNING  | horde_worker_regen.bridge_data.data_model:validate_performance_modes:167 - Please let us know if `unload_models_from_vram_often` improves or degrades performance with `high_memory_mode` enabled.
reGen  | 2024-10-30 18:41:01.056 | WARNING  | horde_model_reference.model_reference_records:validator_is_style_known:132 - Unknown style control_qr for model control_qr
reGen  | 2024-10-30 18:41:01.056 | WARNING  | horde_model_reference.model_reference_records:validator_is_style_known:132 - Unknown style control_qr_xl for model control_qr_xl
reGen  | 2024-10-30 18:41:01.061 | DEBUG    | horde_sdk.ai_horde_worker.model_meta:remove_large_models:155 - Removing cascade models: {'Stable Cascade 1.0'}
reGen  | 2024-10-30 18:41:01.061 | DEBUG    | horde_sdk.ai_horde_worker.model_meta:remove_large_models:156 - Removing flux models: {'Flux.1-Schnell fp16 (Compact)', 'Flux.1-Schnell fp8 (Compact)'}
reGen  | /horde-worker-reGen/venv/lib/python3.11/site-packages/transformers/utils/hub.py:128: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
reGen  |   warnings.warn(
reGen  | 2024-10-30 18:41:02.834 | INFO     | horde_safety.deep_danbooru_model:download_deep_danbooru_model:53 - Downloading DeepDanbooru model (~614 mb) to models/clip_blip/model-resnet_custom_v3.pt.
models/clip_blip/model-resnet_custom_v3.pt:   0% 0.00/644M [00:00<?, ?iB/s]2024-10-30 18:41:03.458 | INFO     | horde_safety.deep_danbooru_model:download_deep_danbooru_model:63 - Model already downloaded.
reGen  | 2024-10-30 18:41:03.458 | INFO     | horde_safety.deep_danbooru_model:verify_deep_danbooru_model_hash:30 - Verifying SHA256 hash of downloaded file.
models/clip_blip/model-resnet_custom_v3.pt:   0% 0.00/644M [00:00<?, ?iB/s]
reGen  | Loading CLIP model ViT-L-14/openai...
reGen  | /horde-worker-reGen/venv/lib/python3.11/site-packages/open_clip/factory.py:372: UserWarning: These pretrained weights were trained with QuickGELU activation but the model config does not have that enabled. Consider using a model config with a "-quickgelu" suffix or enable with a flag.
reGen  |   warnings.warn(
reGen  | Loaded CLIP model and data in 2.94 seconds.
reGen  | 2024-10-30 18:41:06.832 | INFO     | hordelib.comfy_horde:do_comfy_import:215 - Forcing normal vram mode
reGen  | Traceback (most recent call last):
reGen  |   File "/horde-worker-reGen/download_models.py", line 25, in <module>
reGen  |     download_all_models(
reGen  |   File "/horde-worker-reGen/horde_worker_regen/download_models.py", line 58, in download_all_models
reGen  |     hordelib.initialise()
reGen  |   File "/horde-worker-reGen/venv/lib/python3.11/site-packages/hordelib/initialisation.py", line 81, in initialise
reGen  |     hordelib.comfy_horde.do_comfy_import(
reGen  |   File "/horde-worker-reGen/venv/lib/python3.11/site-packages/hordelib/comfy_horde.py", line 229, in do_comfy_import
reGen  |     import execution
reGen  |   File "/horde-worker-reGen/venv/lib/python3.11/site-packages/hordelib/_comfyui/execution.py", line 13, in <module>
reGen  |     import nodes
reGen  |   File "/horde-worker-reGen/venv/lib/python3.11/site-packages/hordelib/_comfyui/nodes.py", line 21, in <module>
reGen  |     import comfy.diffusers_load
reGen  |   File "/horde-worker-reGen/venv/lib/python3.11/site-packages/hordelib/_comfyui/comfy/diffusers_load.py", line 3, in <module>
reGen  |     import comfy.sd
reGen  |   File "/horde-worker-reGen/venv/lib/python3.11/site-packages/hordelib/_comfyui/comfy/sd.py", line 5, in <module>
reGen  |     from comfy import model_management
reGen  |   File "/horde-worker-reGen/venv/lib/python3.11/site-packages/hordelib/_comfyui/comfy/model_management.py", line 143, in <module>
reGen  |     total_vram = get_total_memory(get_torch_device()) / (1024 * 1024)
reGen  |                                   ^^^^^^^^^^^^^^^^^^
reGen  |   File "/horde-worker-reGen/venv/lib/python3.11/site-packages/hordelib/_comfyui/comfy/model_management.py", line 112, in get_torch_device
reGen  |     return torch.device(torch.cuda.current_device())
reGen  |                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^
reGen  |   File "/horde-worker-reGen/venv/lib/python3.11/site-packages/torch/cuda/__init__.py", line 778, in current_device
reGen  |     _lazy_init()
reGen  |   File "/horde-worker-reGen/venv/lib/python3.11/site-packages/torch/cuda/__init__.py", line 293, in _lazy_init
reGen  |     torch._C._cuda_init()
reGen  | RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx

HPPinata commented 3 weeks ago

The docker instructions aren't working for me (Arch Linux / nvidia GPU)

git clone --sparse --branch raw-png https://github.com/Haidra-Org/horde-worker-reGen.git horde-worker-reGen-png
cd horde-worker-reGen-png/
git sparse-checkout set --no-cone Dockerfiles /bridgeData_template.yaml
docker compose -f Dockerfiles/compose.cuda.yaml build --pull
docker compose -f Dockerfiles/compose.cuda.yaml up -dV

Do you have your system set up to make cuda work at all? sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi

Ironically getting nvidia to work inside docker is not as painless as AMD, due to their custom Kernel stuff https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/sample-workload.html

HPPinata commented 3 weeks ago

I'm not sure what is and isn't required since I've not tested NVIDIA GPUs on Linux for a while, but you might need (some portion of) the cuda tooling installed locally.

CIB commented 3 weeks ago

The docker instructions aren't working for me (Arch Linux / nvidia GPU)
git clone --sparse --branch raw-png https://github.com/Haidra-Org/horde-worker-reGen.git horde-worker-reGen-png
cd horde-worker-reGen-png/
git sparse-checkout set --no-cone Dockerfiles /bridgeData_template.yaml
docker compose -f Dockerfiles/compose.cuda.yaml build --pull
docker compose -f Dockerfiles/compose.cuda.yaml up -dV
Do you have your system set up to make cuda work at all? sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi

Ironically getting nvidia to work inside docker is not as painless as AMD, due to their custom Kernel stuff https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/sample-workload.html

Yes. In fact, I created my own Dockerfile before I knew this branch existed, and it's running fine on my system as we speak. So I'm also stumped. I can dive a bit more into comparing the two containers to figure out what's going on.

docker run --rm --gpus all ubuntu nvidia-smi --query-gpu=name --format=csv,noheader
NVIDIA GeForce RTX 4090

HPPinata commented 3 weeks ago

Yes. In fact, I created my own Dockerfile before I knew this branch existed, and it's running fine on my system as we speak. So I'm also stumped. I can dive a bit more into comparing the two containers to figure out what's going on.

Please do. I haven't had much to do with the creation of the Dockerfile.cuda and @tazlin found it to be working iirc. but the compose.cuda.yaml is a complete blindshot based on what worked for AMD and what I found online. There might very well be a few issues with that, especially around exposing the GPU to the container.

CIB commented 3 weeks ago

There might very well be a few issues with that, especially around exposing the GPU to the container.

Good call. I compared the two docker-compose.yml files, and found that the gpu configurations were ever so slightly different. With count: all added here, now the error is gone.

    deploy:
      resources:
        reservations:
          devices:
          - driver: nvidia
            capabilities: [gpu]
            count: all

HPPinata commented 3 weeks ago

I think you can just create a small separate PR to be merged into raw-png (not main). This wouldn't fit anything I have open, shouldn't conflict with much else either and you should be the one credited for fixing what was broken.

CIB commented 3 weeks ago

I think you can just create a small separate PR to be merged into raw-png (not main). This wouldn't fit anything I have open, shouldn't conflict with much else either and you should be the one credited for fixing what was broken.

Done. https://github.com/Haidra-Org/horde-worker-reGen/pull/334

tazlin commented 2 weeks ago

@CodiumAI-Agent /review

CodiumAI-Agent commented 2 weeks ago

PR Reviewer Guide 🔍

(Review updated until commit https://github.com/Haidra-Org/horde-worker-reGen/commit/50c53346fe626c86668a88356a0026f9f4dc7e04)

Here are some key observations to aid the review process:

**🎫 Ticket compliance analysis 🔶** **[333](https://github.com/Haidra-Org/horde-worker-reGen/issues/333) - Partially compliant** Fully compliant requirements: - Update PyTorch version to 2.5.0 without breaking older setups. - Skip installing flash_attn on compatible cards if `FLASH_ATTENTION_USE_TRITON_ROCM=FALSE`. Not compliant requirements: - Test 256 head dimensions for potential use in FLUX.1. - Test if Triton makes the use of flash_attn possible on older RDNA cards. **[334](https://github.com/Haidra-Org/horde-worker-reGen/issues/334) - Fully compliant** Fully compliant requirements: - Ensure the `count: all` setting is included to prevent CUDA unavailability. **[335](https://github.com/Haidra-Org/horde-worker-reGen/issues/335) - Fully compliant** Fully compliant requirements: - Use SIGINT to stop the docker container allowing graceful shutdown.

⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪

🧪 No relevant tests

🔒 No security concerns identified

⚡ Recommended focus areas for review

Possible Bug
The method `on_process_ending` is introduced, replacing `on_process_ended`. Ensure that this change is reflected everywhere in the codebase and that it does not introduce any new issues. Performance Issue
The `sdpa_hijack` function now supports 256 head dimensions. Performance implications of this change should be reviewed, especially under different configurations.

CodiumAI-Agent commented 1 week ago

Persistent review updated to latest commit https://github.com/Haidra-Org/horde-worker-reGen/commit/50c53346fe626c86668a88356a0026f9f4dc7e04

Haidra-Org / horde-worker-reGen