Runpod deployment is getting stuck and pods staying in throttled state

ai-dock / comfyui

ComfyUI docker images for use in GPU cloud and local environments. Includes AI-Dock base for authentication and improved user experience.

Other

660 stars 225 forks source link

Runpod deployment is getting stuck and pods staying in throttled state #16

Closed berkelmas closed 9 months ago

berkelmas commented 11 months ago

After I followed the guidance in the issue below, I have changed the IMAGE_BASE to ghcr.io/ai-dock/jupyter-pytorch:2.1.1-py3.10-cuda-11.8.0-base-22.04 after forking the repository and added my own models/custom-nodes to COPY_ROOT_EXTRA and triggered the GitHub pipeline to build the Docker images. I have used the https://github.com/berkorg/comfy-docker/pkgs/container/comfy-docker/156936256?tag=pytorch-2.0.1-py3.10-cuda-11.8.0-base-22.04 image and created a new template in RubPod.

But, in RubPod the endpoint cannot initialize itself and does not also log anything. It stays in the state below:

Can you please help me out?

robballantyne commented 11 months ago

Can you show me exactly what you have in your template settings? I don't need the AWS stuff so please exclude any API keys.

Are the runners correctly pulling the docker image and do you have SERVERLESS=true?

berkelmas commented 11 months ago

Yeah sure. The above are my template settings and env variables. (AWS credentials are below and I did not include them in the screenshot) @robballantyne and this is my forked repo:

https://github.com/berkorg/comfyui

and this is the outputted image that I am using:

https://github.com/berkorg/comfyui/pkgs/container/comfyui/156948877?tag=pytorch-2.1.1-py3.10-cuda-11.8.0-base-22.04

berkelmas commented 11 months ago

and now it got up. Maybe there was an outage but I am getting the error below from the endpoint:

{
    "delayTime": 5561,
    "error": "{'14': {'errors': [{'type': 'value_not_in_list', 'message': 'Value not in list', 'details': \"ckpt_name: 'sd_xl_base_1.0.safetensors' not in []\", 'extra_info': {'input_name': 'ckpt_name', 'input_config': [[]], 'received_value': 'sd_xl_base_1.0.safetensors'}}], 'dependent_outputs': ['9'], 'class_type': 'CheckpointLoaderSimple'}}",
    "executionTime": 1155,
    "id": "sync-95682cbd-9d22-41e1-bc8e-26366a4ec961-e1",
    "status": "FAILED"
}

It says sd_xl_base_1.0.safetensors is not in the list but I have commited that change in my PR and added that to the CHECKPOINT_MODELS tuple: https://github.com/berkorg/comfyui/commit/94c0f0560f9f2f337ed937d2353e3297737a0444

robballantyne commented 11 months ago

I have checked your fork and I noticed you have trailing commas in the bash arrays inside https://github.com/berkorg/comfyui/blob/main/build/COPY_ROOT_EXTRA/opt/ai-dock/bin/build/layer1/init.sh

Bash arrays are delimited by whitespace rather than a comma. Please remove the commas and re-build.

berkelmas commented 11 months ago

Ah I see. Now fixed that and deployed a new endpoint on RunPod. But do you think it is normal that workers are trying to fetch the checkpoint? Don't they need to already exist in the docker image?

robballantyne commented 11 months ago

The worker is only fetching the image to create a container. It only happens when the worker is created and not on each request.

Although my docker images fetch models when running in normal mode, it never happens in serverless mode - If the model is not present in either the container or attached storage it will simply fail as you saw above.

berkelmas commented 11 months ago

Ah okay. So this is then just a one time thing for the initial container creation for the serverless endpoints then 👍