DataDog / datadog-agent

Main repository for Datadog Agent
https://docs.datadoghq.com/
Apache License 2.0
2.89k stars 1.21k forks source link

[BUG] serverless-init agent swallows any errors thrown by underlying code #21879

Closed ajmayes closed 4 months ago

ajmayes commented 10 months ago

Agent Environment serverless-init:1

Describe what happened: The JAR kicked off by the bash startup script throws exit code 1, serverless-init throws exit code 0.

Describe what you expected: I would expect serverless-init to throw exit code 1, to indicate that the application run was unsuccessful. This is important for Google Cloud Run to know if the Job in this specific instance should be re-run.

Steps to reproduce the issue:

  1. Make simple Spring Boot application that will fail on startup.

  2. Create startup script like this one:

    !/bin/bash

    exec java ${JAVA_OPTS:-} -jar /opt/application/app.jar

  3. Use Dockerfile like the following to initialize:

FROM azul/zulu-openjdk:17-latest

RUN groupadd application && useradd -g application application

COPY --from=gcr.io/datadoghq/serverless-init:1 /datadog-init /app/datadog-init ADD https://dtdg.co/latest-java-tracer /dd_tracer/java/dd-java-agent.jar

RUN chown -R application /app/datadog-init /dd_tracer/java/dd-java-agent.jar

USER application

COPY ./build/libs/cloud-run-task-example-0.0.1-SNAPSHOT.jar /opt/application/app.jar COPY startApp.sh /opt/application/startApp.sh

ENTRYPOINT ["/app/datadog-init"] CMD ["/opt/application/startApp.sh"]

  1. Notice that the application exits with Exit code 1, but the Docker container exits with exit code 0. I.E. serverless agent will capture the exit code but does nothing with it

2024-01-05 01:48:23 UTC | SERVERLESS_INIT | DEBUG | Error exiting: exit status 1

2024-01-05 01:48:23 UTC | SERVERLESS_INIT | INFO | Triggering a flush in the logs-agent

2024-01-05 01:48:23 UTC | SERVERLESS_INIT | DEBUG | Flush in the logs-agent done.

2024-01-05 01:48:23 UTC | SERVERLESS_INIT | DEBUG | finished flushing

2024-01-05 01:48:23 UTC | SERVERLESS_INIT | DEBUG | Received a Flush trigger

2024-01-05 01:48:23 UTC | SERVERLESS_INIT | DEBUG | Demultiplexer: sendIterableSeries: start sending iterable series to the serializer

2024-01-05 01:48:23 UTC | SERVERLESS_INIT | DEBUG | The payload was not too big, returning the full payload

2024-01-05 01:48:23 UTC | SERVERLESS_INIT | DEBUG | SyncForwarder has flushed 1 transactions

2024-01-05 01:48:23 UTC | SERVERLESS_INIT | DEBUG | Demultiplexer: sendIterableSeries: stop routine

as can be seen in this code: initcontainer.Run()

Additional environment details (Operating System, Cloud provider, etc): Google Cloud Run, Java (Spring Boot)

tjosgood commented 9 months ago

same problem, exactly as described, it means that CloudRun jobs always report success even when the task failed

sjmunoz commented 7 months ago

same problem, exactly as described

ilkerc commented 7 months ago

Same issue, looking for workarounds

alexgallotta commented 7 months ago

A similar issue was addressed in a bug fix on version

1.1.2
Fixes propagation of OS signals

https://hub.docker.com/r/datadog/serverless-init

can you try to update to that version or greater and check if it still happens?

ajmayes commented 7 months ago

It looks like that fix went out 5 months ago, I've been using latest this entire time and the issue is there.

alexgallotta commented 7 months ago

Thanks for confirming, we will add to our issue list and look into that as soon as possible!

ilkerc commented 6 months ago

Maybe this can motivate;

Waking up everyday, checking that issue's state, One of these days, D'dog will fix my fate. Hoping this verse I penned will accelerate, Before my code turns into something we all hate.

crea1 commented 5 months ago

I reached out to Datadog support and created a ticket for this. The response was that they didn't officially support Cloud Run Jobs. Which is a bit confusing to me since in GCP they are grouped together, so you would assume its kind of the same. And you get some cloud run jobs specific metrics. I don't know how this affects "normal" cloud run applications. Anyways they had an open feature request to support this, so they added me to the list of interested in hopes of this getting prioritized.

jfgreen-liberis commented 4 months ago

Running into this issue also using Cloud Run Jobs and datadog/serverless-init.

Would be great if datadog-init could preserve the status code of the subprocess it wraps.

duncanpharvey commented 4 months ago

Hi all! I wanted to share that we currently have a fix in progress to return an exit code of 1 if there is an error during the application run. I'll share here once we've released a version of serverless-init that does not swallow errors.

https://github.com/DataDog/datadog-agent/pull/27259

Update: instead of always returning exit code 1 on an error serverless-init will attempt to propagate an exit code if one is available

duncanpharvey commented 4 months ago

serverless-init v1.2.5 is released as of today! With this version moving forward, exit codes from an instrumented application will be propagated by serverless-init. Please feel free to reopen this issue if anyone encounters unexpected behavior related to this feature.

https://hub.docker.com/r/datadog/serverless-init/tags

ajmayes commented 4 months ago

Just tested it, it's working. Thanks Duncan!