InstantID workflow broken in the last few hours

tomryanx commented 1 week ago

I've run my instantId comfy workflow many thousands of times over the last ~6 months. A couple of hours ago it stopped working with this error

====================================
Running workflow
ComfyUI error: 500 Internal Server Error
Traceback (most recent call last):
File "/root/.pyenv/versions/3.10.6/lib/python3.10/site-packages/cog/server/worker.py", line 461, in _handle_predict_error
yield
File "/root/.pyenv/versions/3.10.6/lib/python3.10/site-packages/cog/server/worker.py", line 411, in _predict
result = predict(**payload)
File "/src/predict.py", line 95, in predict
self.comfyUI.run_workflow(wf)
File "/src/helpers/comfyui.py", line 259, in run_workflow
prompt_id = self.queue_prompt(workflow)
File "/src/helpers/comfyui.py", line 194, in queue_prompt
raise Exception(

I'm guessing this may be related to the changes in custom nodes around 14h ago. Below are the full logs. I'm not super familiar with the log format but it seems to bail instead of fetching ip-adapter-plus-face_sdxl_vit-h.safetensors

All runs before 2024-10-23T23:51:52.631992Z worked, all runs after that time fail.

Checking inputs
Downloading <snip> to /tmp/inputs/m2mql5xx.jpeg
✅ /tmp/inputs/m2mql5xx.jpeg
Downloading <snip> to /tmp/inputs/kahlo-pose.jpg
✅ /tmp/inputs/kahlo-pose.jpg
Downloading <snip> to /tmp/inputs/kahlo-style.jpg
✅ /tmp/inputs/kahlo-style.jpg
====================================
Checking weights
⏳ Downloading instantid-ip-adapter.bin to ComfyUI/models/instantid
⌛️ Downloaded instantid-ip-adapter.bin in 1.98s, size: 1612.79MB
✅ instantid-ip-adapter.bin
⏳ Downloading antelopev2 to ComfyUI/models/insightface
⌛️ Downloaded antelopev2 in 0.66s
✅ antelopev2
⏳ Downloading dreamshaperXL_lightningDPMSDE.safetensors to ComfyUI/models/checkpoints
⌛️ Downloaded dreamshaperXL_lightningDPMSDE.safetensors in 6.94s, size: 6617.76MB
✅ dreamshaperXL_lightningDPMSDE.safetensors
⏳ Downloading controlnet-depth-sdxl-1.0.fp16.safetensors to ComfyUI/models/controlnet
⌛️ Downloaded controlnet-depth-sdxl-1.0.fp16.safetensors in 2.74s, size: 2386.23MB
✅ controlnet-depth-sdxl-1.0.fp16.safetensors
⏳ Downloading IPAdapter_image_encoder_sd15.safetensors to ComfyUI/models/clip_vision
⌛️ Downloaded IPAdapter_image_encoder_sd15.safetensors in 2.46s, size: 2411.24MB
✅ IPAdapter_image_encoder_sd15.safetensors
⏳ Downloading ip-adapter-plus_sdxl_vit-h.safetensors to ComfyUI/models/ipadapter
⌛️ Downloaded ip-adapter-plus_sdxl_vit-h.safetensors in 0.95s, size: 808.26MB
✅ ip-adapter-plus_sdxl_vit-h.safetensors
⏳ Downloading instantid-controlnet.safetensors to ComfyUI/models/controlnet
⌛️ Downloaded instantid-controlnet.safetensors in 2.46s, size: 2386.23MB
✅ instantid-controlnet.safetensors
⏳ Downloading ip-adapter-plus-face_sdxl_vit-h.safetensors to ComfyUI/models/ipadapter
⌛️ Downloaded ip-adapter-plus-face_sdxl_vit-h.safetensors in 0.90s, size: 808.26MB
✅ ip-adapter-plus-face_sdxl_vit-h.safetensors
====================================
Running workflow
ComfyUI error: 500 Internal Server Error
Traceback (most recent call last):
File "/root/.pyenv/versions/3.10.6/lib/python3.10/site-packages/cog/server/worker.py", line 461, in _handle_predict_error
yield
File "/root/.pyenv/versions/3.10.6/lib/python3.10/site-packages/cog/server/worker.py", line 411, in _predict
result = predict(**payload)
File "/src/predict.py", line 95, in predict
self.comfyUI.run_workflow(wf)
File "/src/helpers/comfyui.py", line 259, in run_workflow
prompt_id = self.queue_prompt(workflow)
File "/src/helpers/comfyui.py", line 194, in queue_prompt
raise Exception(
Exception: ComfyUI Error – Your workflow could not be run. This usually happens if you’re trying to use an unsupported node. Check the logs for 'KeyError: ' details, and go to https://github.com/fofr/cog-comfyui to see the list of supported custom nodes.

a29paul commented 1 week ago

I'm getting no outputs now. Happened around 12:15pm UTC. Breaking change going to cause a lot of loss in revenue for us unfortunately. That said, I haven't pinpointed the issue so there's nobody to point fingers at yet.

tomryanx commented 1 week ago

@a29paul can you paste logs?

a29paul commented 1 week ago

I should be getting an image as output. Instead only 4 nodes run which is only ~15% of the workflow and there's no output. These are my logs. Trying to find what caused this. First occurrence happened exactly at 12:13AM UTC.


Random seed set to: 917424458
Checking inputs
✅ /tmp/inputs/image.png
====================================
Checking weights
✅ bert-base-uncased exists in ComfyUI/models/bert-base-uncased
✅ groundingdino_swinb_cogcoor.pth exists in ComfyUI/models/grounding-dino
✅ Juggernaut_X_RunDiffusion.safetensors exists in ComfyUI/models/checkpoints
✅ sam_vit_h_4b8939.pth exists in ComfyUI/models/sams
✅ natural_glam_lora.safetensors exists in ComfyUI/models/loras
====================================
Running workflow
Executing node 2, title: Load Image, class type: LoadImage
Executing node 256, title: Image scale to side, class type: DF_Image_scale_to_side
Executing node 364, title: MediaPipe Face Mesh, class type: MediaPipe-FaceMeshPreprocessor
Executing node 367, title: MediaPipe FaceMesh to SEGS, class type: MediaPipeFaceMeshToSEGS
outputs:  {}
====================================

a29paul commented 1 week ago

Just updated all our dependencies and it resolved the issues. Haven't updated anything in a while so I figured an update would've helped. Might help you out too in this case. @tomryanx

tomryanx commented 1 week ago

I think there's a clue here for me: https://github.com/cubiq/ComfyUI_IPAdapter_plus/blob/main/NODES.md#main-ipadapter-apply-nodes

@a29paul how did you update dependencies?

nickstenning commented 1 week ago

I think this is our fault. I'm pretty sure we released a broken version of cog to production. We've just rolled this change back.

nickstenning commented 1 week ago

For future reference: https://replicatestatus.com/incident/449862

Sorry for the disruption, folks. We know we've got stuff to fix here and will be working on it.

fofr / cog-comfyui

InstantID workflow broken in the last few hours #191