No error message for container with no startup command

drewpca commented 3 years ago

This was a mysterious exit from skaffold dev:

time="2021-04-13T22:06:25-07:00" level=warning msg="starting gRPC server on port 50053. (50051 is already in use)"
time="2021-04-13T22:06:25-07:00" level=warning msg="starting gRPC HTTP server on port 50054. (50052 is already in use)"
Listing files to watch...
 - bang5:5000/fingerprint_unlocker_image
Generating tags...
 - bang5:5000/fingerprint_unlocker_image -> bang5:5000/fingerprint_unlocker_image:latest
Some taggers failed. Rerun with -vdebug for errors.
Checking cache...
 - bang5:5000/fingerprint_unlocker_image: Found Remotely
Starting test...
Tags used in deployment:
 - bang5:5000/fingerprint_unlocker_image -> bang5:5000/fingerprint_unlocker_image:latest@sha256:8cda349a1d25e9bf7be0eeead25acad7b3f0063508df85c5aa96da670dd93ca5
Starting deploy...
 - service/fingerprint-unlocker created
 - deployment.apps/fingerprint-unlocker created
Waiting for deployments to stabilize...
 - deployment/fingerprint-unlocker: container fingerprint-unlocker is backing off waiting to restart
    - pod/fingerprint-unlocker-d9df9b8f8-77xf2: container fingerprint-unlocker is backing off waiting to restart
 - deployment/fingerprint-unlocker failed. Error: container fingerprint-unlocker is backing off waiting to restart.
Cleaning up...
 - service "fingerprint-unlocker" deleted
 - deployment.apps "fingerprint-unlocker" deleted
exiting dev mode because first deploy failed: 1/1 deployment(s) failed

I don't know a way to read the k8s logs after skaffold has deleted the deployment. So I tried with vdebug:

time="2021-04-13T22:06:45-07:00" level=debug msg="found open port: 50053"
time="2021-04-13T22:06:45-07:00" level=warning msg="starting gRPC server on port 50053. (50051 is already in use)"
time="2021-04-13T22:06:45-07:00" level=debug msg="found open port: 50054"
time="2021-04-13T22:06:45-07:00" level=warning msg="starting gRPC HTTP server on port 50054. (50052 is already in use)"
time="2021-04-13T22:06:45-07:00" level=info msg="Skaffold &{Version:v1.21.0 ConfigVersion:skaffold/v2beta13 GitVersion: GitCommit:4830337932d53185445812d29f078667b3b74fa5 BuildDate:2021-03-19T01:48:43Z GoVersion:go1.14.14 Compiler:gc Platform:linux/amd64}"
time="2021-04-13T22:06:45-07:00" level=info msg="Loaded Skaffold defaults from \"/home/drewp/.skaffold/config\""
time="2021-04-13T22:06:45-07:00" level=debug msg="parsed 1 configs from configuration file /my/proj/homeauto/espNode/desk/skaffold.yaml"
time="2021-04-13T22:06:45-07:00" level=debug msg="found config for context \"default\""
time="2021-04-13T22:06:45-07:00" level=debug msg="Defaulting build type to local build"
time="2021-04-13T22:06:45-07:00" level=info msg="Using kubectl context: default"
time="2021-04-13T22:06:45-07:00" level=info msg="Using insecure-registries=[bang5:5000] from config"
time="2021-04-13T22:06:45-07:00" level=debug msg="setting Docker user agent to skaffold-v1.21.0"
time="2021-04-13T22:06:45-07:00" level=debug msg="Using builder: local"
time="2021-04-13T22:06:45-07:00" level=debug msg="push value not present, defaulting to true because cluster.PushImages is true"
Listing files to watch...
 - bang5:5000/fingerprint_unlocker_image
time="2021-04-13T22:06:45-07:00" level=debug msg="Found dependencies for dockerfile: [{requirements.txt /opt true} {admin.py /opt true} {fingerprint_image.py /opt true} {unlocker.py /opt true}]"
time="2021-04-13T22:06:45-07:00" level=info msg="List generated in 14.32333ms"
Generating tags...
 - bang5:5000/fingerprint_unlocker_image -> time="2021-04-13T22:06:45-07:00" level=debug msg="could not parse date \"\""
time="2021-04-13T22:06:45-07:00" level=debug msg="Running command: [git describe --tags --always]"
time="2021-04-13T22:06:45-07:00" level=debug msg="generating tag: unable to find git commit: running [git describe --tags --always]\n - stdout: \"\"\n - stderr: \"fatal: not a git repository (or any parent up to mount point /my)\\nStopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set).\\n\"\n - cause: exit status 128"
time="2021-04-13T22:06:45-07:00" level=debug msg="Using a fall-back tagger"
bang5:5000/fingerprint_unlocker_image:latest
Some taggers failed. Rerun with -vdebug for errors.
time="2021-04-13T22:06:45-07:00" level=info msg="Tags generated in 17.958451ms"
Checking cache...
time="2021-04-13T22:06:45-07:00" level=debug msg="Found dependencies for dockerfile: [{requirements.txt /opt true} {admin.py /opt true} {fingerprint_image.py /opt true} {unlocker.py /opt true}]"
time="2021-04-13T22:06:45-07:00" level=debug msg="push value not present, defaulting to true because cluster.PushImages is true"
 - bang5:5000/fingerprint_unlocker_image: time="2021-04-13T22:06:45-07:00" level=debug msg="push value not present, defaulting to true because cluster.PushImages is true"
Found Remotely
time="2021-04-13T22:06:45-07:00" level=debug msg="push value not present, defaulting to true because cluster.PushImages is true"
time="2021-04-13T22:06:45-07:00" level=info msg="Cache check completed in 45.905996ms"
Starting test...
time="2021-04-13T22:06:45-07:00" level=info msg="Test completed in 30.208µs"
Tags used in deployment:
 - bang5:5000/fingerprint_unlocker_image -> bang5:5000/fingerprint_unlocker_image:latest@sha256:8cda349a1d25e9bf7be0eeead25acad7b3f0063508df85c5aa96da670dd93ca5
time="2021-04-13T22:06:45-07:00" level=debug msg="push value not present, defaulting to true because cluster.PushImages is true"
time="2021-04-13T22:06:45-07:00" level=debug msg="getting client config for kubeContext: ``"
Starting deploy...
time="2021-04-13T22:06:45-07:00" level=debug msg="Running command: [kubectl version --client -ojson]"
time="2021-04-13T22:06:46-07:00" level=debug msg="Command output: [{\n  \"clientVersion\": {\n    \"major\": \"1\",\n    \"minor\": \"18\",\n    \"gitVersion\": \"v1.18.10+k3s1\",\n    \"gitCommit\": \"6fa9730656bfc62993cb5a5d8e1be9924bedf021\",\n    \"gitTreeState\": \"clean\",\n    \"buildDate\": \"2020-10-16T00:08:42Z\",\n    \"goVersion\": \"go1.13.15\",\n    \"compiler\": \"gc\",\n    \"platform\": \"linux/amd64\"\n  }\n}\n]"
time="2021-04-13T22:06:46-07:00" level=debug msg="Running command: [kubectl --context default create --dry-run=client -oyaml -f /my/proj/homeauto/espNode/desk/deploy.yaml]"
time="2021-04-13T22:06:46-07:00" level=debug msg="Command output: [apiVersion: v1\nkind: Service\nmetadata:\n  labels:\n    app: fingerprint-unlocker\n  name: fingerprint-unlocker\n  namespace: default\nspec:\n  ports:\n  - port: 10000\n    protocol: TCP\n  selector:\n    app: fingerprint-unlocker\n---\napiVersion: apps/v1\nkind: Deployment\nmetadata:\n  labels:\n    app: fingerprint-unlocker\n  name: fingerprint-unlocker\n  namespace: default\nspec:\n  replicas: 1\n  selector:\n    matchLabels:\n      app: fingerprint-unlocker\n  template:\n    metadata:\n      labels:\n        app: fingerprint-unlocker\n    spec:\n      containers:\n      - image: bang5:5000/fingerprint_unlocker_image\n        name: fingerprint-unlocker\n]"
time="2021-04-13T22:06:46-07:00" level=debug msg="manifests with tagged images: apiVersion: v1\nkind: Service\nmetadata:\n  labels:\n    app: fingerprint-unlocker\n  name: fingerprint-unlocker\n  namespace: default\nspec:\n  ports:\n  - port: 10000\n    protocol: TCP\n  selector:\n    app: fingerprint-unlocker\n---\napiVersion: apps/v1\nkind: Deployment\nmetadata:\n  labels:\n    app: fingerprint-unlocker\n  name: fingerprint-unlocker\n  namespace: default\nspec:\n  replicas: 1\n  selector:\n    matchLabels:\n      app: fingerprint-unlocker\n  template:\n    metadata:\n      labels:\n        app: fingerprint-unlocker\n    spec:\n      containers:\n      - image: bang5:5000/fingerprint_unlocker_image:latest@sha256:8cda349a1d25e9bf7be0eeead25acad7b3f0063508df85c5aa96da670dd93ca5\n        name: fingerprint-unlocker"
time="2021-04-13T22:06:46-07:00" level=debug msg="manifests with labels apiVersion: v1\nkind: Service\nmetadata:\n  labels:\n    app: fingerprint-unlocker\n    app.kubernetes.io/managed-by: skaffold\n    skaffold.dev/run-id: 9eebd02e-7450-4f3a-b4b2-afdc78e1250d\n  name: fingerprint-unlocker\n  namespace: default\nspec:\n  ports:\n  - port: 10000\n    protocol: TCP\n  selector:\n    app: fingerprint-unlocker\n---\napiVersion: apps/v1\nkind: Deployment\nmetadata:\n  labels:\n    app: fingerprint-unlocker\n    app.kubernetes.io/managed-by: skaffold\n    skaffold.dev/run-id: 9eebd02e-7450-4f3a-b4b2-afdc78e1250d\n  name: fingerprint-unlocker\n  namespace: default\nspec:\n  replicas: 1\n  selector:\n    matchLabels:\n      app: fingerprint-unlocker\n  template:\n    metadata:\n      labels:\n        app: fingerprint-unlocker\n        app.kubernetes.io/managed-by: skaffold\n        skaffold.dev/run-id: 9eebd02e-7450-4f3a-b4b2-afdc78e1250d\n    spec:\n      containers:\n      - image: bang5:5000/fingerprint_unlocker_image:latest@sha256:8cda349a1d25e9bf7be0eeead25acad7b3f0063508df85c5aa96da670dd93ca5\n        name: fingerprint-unlocker"
time="2021-04-13T22:06:46-07:00" level=debug msg="Running command: [kubectl --context default get -f - --ignore-not-found -ojson]"
time="2021-04-13T22:06:46-07:00" level=debug msg="Command output: []"
time="2021-04-13T22:06:46-07:00" level=debug msg="2 manifests to deploy. 2 are updated or new"
time="2021-04-13T22:06:46-07:00" level=debug msg="Running command: [kubectl --context default apply -f -]"
 - service/fingerprint-unlocker created
 - deployment.apps/fingerprint-unlocker created
time="2021-04-13T22:06:47-07:00" level=info msg="Deploy completed in 1.318 second"
Waiting for deployments to stabilize...
time="2021-04-13T22:06:47-07:00" level=debug msg="getting client config for kubeContext: ``"
time="2021-04-13T22:06:47-07:00" level=debug msg="checking status deployment/fingerprint-unlocker"
time="2021-04-13T22:06:48-07:00" level=debug msg="Running command: [kubectl --context default rollout status deployment fingerprint-unlocker --namespace default --watch=false]"
time="2021-04-13T22:06:48-07:00" level=debug msg="Command output: [Waiting for deployment \"fingerprint-unlocker\" rollout to finish: 0 of 1 updated replicas are available...\n]"
time="2021-04-13T22:06:48-07:00" level=debug msg="Pod \"fingerprint-unlocker-84f4c7b468-4xrnn\" scheduled: checking container statuses"
time="2021-04-13T22:06:49-07:00" level=debug msg="Running command: [kubectl --context default rollout status deployment fingerprint-unlocker --namespace default --watch=false]"
time="2021-04-13T22:06:50-07:00" level=debug msg="Command output: [Waiting for deployment \"fingerprint-unlocker\" rollout to finish: 0 of 1 updated replicas are available...\n]"
time="2021-04-13T22:06:50-07:00" level=debug msg="Pod \"fingerprint-unlocker-84f4c7b468-4xrnn\" scheduled: checking container statuses"
time="2021-04-13T22:06:50-07:00" level=debug msg="Fetching events for pod \"fingerprint-unlocker-84f4c7b468-4xrnn\""
time="2021-04-13T22:06:51-07:00" level=debug msg="Running command: [kubectl --context default rollout status deployment fingerprint-unlocker --namespace default --watch=false]"
time="2021-04-13T22:06:51-07:00" level=debug msg="Command output: [Waiting for deployment \"fingerprint-unlocker\" rollout to finish: 0 of 1 updated replicas are available...\n]"
time="2021-04-13T22:06:51-07:00" level=debug msg="Pod \"fingerprint-unlocker-84f4c7b468-4xrnn\" scheduled: checking container statuses"
time="2021-04-13T22:06:51-07:00" level=debug msg="Fetching logs for container fingerprint-unlocker-84f4c7b468-4xrnn/fingerprint-unlocker"
 - deployment/fingerprint-unlocker: container fingerprint-unlocker is backing off waiting to restart
    - pod/fingerprint-unlocker-84f4c7b468-4xrnn: container fingerprint-unlocker is backing off waiting to restart
 - deployment/fingerprint-unlocker failed. Error: container fingerprint-unlocker is backing off waiting to restart.
Cleaning up...
time="2021-04-13T22:06:51-07:00" level=debug msg="Running command: [kubectl --context default create --dry-run=client -oyaml -f /my/proj/homeauto/espNode/desk/deploy.yaml]"
time="2021-04-13T22:06:52-07:00" level=debug msg="Command output: [apiVersion: v1\nkind: Service\nmetadata:\n  labels:\n    app: fingerprint-unlocker\n  name: fingerprint-unlocker\n  namespace: default\nspec:\n  ports:\n  - port: 10000\n    protocol: TCP\n  selector:\n    app: fingerprint-unlocker\n---\napiVersion: apps/v1\nkind: Deployment\nmetadata:\n  labels:\n    app: fingerprint-unlocker\n  name: fingerprint-unlocker\n  namespace: default\nspec:\n  replicas: 1\n  selector:\n    matchLabels:\n      app: fingerprint-unlocker\n  template:\n    metadata:\n      labels:\n        app: fingerprint-unlocker\n    spec:\n      containers:\n      - image: bang5:5000/fingerprint_unlocker_image\n        name: fingerprint-unlocker\n]"
time="2021-04-13T22:06:52-07:00" level=debug msg="Running command: [kubectl --context default delete --ignore-not-found=true -f -]"
 - service "fingerprint-unlocker" deleted
 - deployment.apps "fingerprint-unlocker" deleted
time="2021-04-13T22:06:52-07:00" level=info msg="Cleanup completed in 670.45783ms"
exiting dev mode because first deploy failed: 1/1 deployment(s) failed
time="2021-04-13T22:06:52-07:00" level=debug msg="exporting metrics"
time="2021-04-13T22:06:53-07:00" level=debug msg="metrics uploading complete in 1.033947268s"

This is still mysterious.

The issue was that I had no entrypoint/command line in Dockerfile nor any command set in deploy.yaml. Is it in-scope for skaffold to detect that? If not, is it reasonable for skaf to leave the bad deployment up for me to inspect, like "this didn't startup once, and the problem will be logged on the k8s side. Run kubectl logs ____ to see the error."

Information

Skaffold version: 1.21.0
Operating system: ubuntu
Installed via: wget

gsquared94 commented 3 years ago

I think if you run skaffold with --cleanup=false flag then it won't clean up it's deployments.

MarlonGamez commented 3 years ago

@drewpca was using --cleanup=false able to help you with this issue? Do you think we should special case this situation and add helpful tips to our output for this?

drewpca commented 3 years ago

I already knew the fix, so I wasn't trying to diagnose my setup more. I'd rather focus on how skaffold can help me follow the chain of errors until I got to the right logs.

Here's a separate mistake I made the other day: my Dockerfile had CMD ["/opt/entrypoint.sh"] but entrypoint.sh was not chmod executable. skaffold -v debug showed me this:

DEBU[0011] Unknown waiting reason for container "nginx2": {&ContainerStateWaiting{Reason:RunContainerError,Message:failed to create containerd task: OCI runtime create failed: container_linux.go:349: starting container process caused "exec: \"/opt/entrypoint.sh\": permission denied": unknown,} nil nil}

but without -vdebug, the problem was again mysterious. Is it straightforward to make that one log line visible when it's a "starting container process caused..." type? I would think that's always going to be important to debugging.

I'd like a general policy that if there's an error with startup (like there was in both of these cases), skaffold either 1) shows that error, or 2) exits with advice on where to look next, which might be "rerun with -vdebug" or "rerun with --cleanup=false then look at kubectl logs".

tejal29 commented 3 years ago

Thanks for filing this issue. Sorry status check did not surface this issue.

briandealwis commented 2 years ago

My question for @tejal29: we currently only treat RunContainerError as an error when the message indicates a docker failure: https://github.com/GoogleContainerTools/skaffold/blob/aea70c4ca76e77f9c32929514e0685e33912877d/pkg/diag/validator/pod.go#L382-L386 https://github.com/GoogleContainerTools/skaffold/blob/aea70c4ca76e77f9c32929514e0685e33912877d/pkg/diag/validator/pod.go#L60 https://github.com/GoogleContainerTools/skaffold/blob/aea70c4ca76e77f9c32929514e0685e33912877d/pkg/diag/validator/pod.go#L43

Why not just flag all RunContainerErrors as an error

    case runContainerError:
        return proto.StatusCode_STATUSCHECK_RUN_CONTAINER_ERR, nil, fmt.Errorf("container %s in error: %s", c.Name, c.State.Waiting.Message)
    }

Or perhaps we can add more to the errorPrefix regexp:

    errorPrefix         = `(?P<Prefix>.*)(?P<DaemonLog>Error response from daemon\:|starting container process caused )(?P<Error>.*)`

tejal29 commented 2 years ago

Sounds good. When this RE was added it was catered/specific towards cloud run. I have no issues changing this to be more generic.

GoogleContainerTools / skaffold

No error message for container with no startup command #5691

Information