[dagster-k8s] Pods launched via Dagster Pipes report failure if pod is alive longer than 4 hours

MattyKuzyk commented 4 months ago

Dagster version

1.6.13

What's the issue?

Our asset launches pods via PipesK8sClient.run():

        session = pipes_k8s_client.run(
            context=context,
            base_pod_meta=pod["metadata"],
            base_pod_spec=pod["spec"],
            extras=extras,
        )

        if len(outs) > 1:
            return session.get_results()
        else:
            return session.get_materialize_result()

On the runner side, when the ETL process is finished, we report an asset materialization to Dagster Pipes:

     for key, value in metadata.items():
        key = key.replace("-", "_")
        context.report_asset_materialization(
            asset_key=None if key == "" else key,
            metadata=value,
        )

We also have an asset check called main_container_success we report back to Dagster.

This works fantastic for us and we run these workloads many times a day without issue. However, when the run exceeds 4 hours, it will consistently fail even if the code exits successfully. Here is a screenshot showing a busybox pod simply running sleep we used to test this after seeing the transient issue in production workloads:

Note that all the runs over 4 hours are failure, all under are success.

What did you expect to happen?

We expect the materialization to be reported as a success for pipes pods that run longer than 4 hours.

How to reproduce?

Launch a pod with a sleep 15000 via DagsterK8sPipesClient.run()
Report an asset materialization and asset check after the 4 hour mark and exit the pods
See that Dagster reports the task as a failure despite it running to completion successfully

Deployment type

Dagster Helm chart

Deployment details

We are using a custom override of the PipesK8sPodLogsMessageReader. It extracts logs via core_api.read_namespaced_pod_log for each container in the pod (we have init containers and sidecars) and merges the streams together. This is needed as multi-container log reading is not supported by default. See some of this PR discussion for context
As I mentioned it's an ETL system where the launched pod downloads the data it needs in an init container, and then a sidecar monitors the main container (which has no pipes functionality) until it uploads the processed data, exits and reports the status back to Dagster. We have confirmed that the sidecar is successfully uploading, sending the status report, and then shutting down. We also refresh this status watcher when it expires, so no issues there.

Additional information

I suspect this issue has something to do with the default Kubelet timeout of 4hrs: https://github.com/kubernetes/kubernetes/issues/104580#issuecomment-905744137. I don't know via which interface Dagster is reporting the asset checks/materializations, but I will look into that and see if it needs to refresh the connection

It is interesting to note that though in both the failed and successful containers we see the sidecar report the asset check in the logs, on the Dagster side only the successful task shows the check. Failed run:

2024-04-18 21:42:50,382 - dagster-pipes.our-app.monitor_main - INFO - Main container has terminated successfully
2024-04-18 21:42:50.382 {"__dagster_pipes_version":"0.1", "_p":"F", "kubernetes":{…}, "method":"report_asset_check", "params":{…}, "time":"2024-04-18T21:42:50.382569273Z"}
2024-04-18 21:42:50.382 {"__dagster_pipes_version":"0.1", "_p":"F", "kubernetes":{…}, "method":"closed", "params":{…}, "time":"2024-04-18T21:42:50.382582999Z"}

Successful run (note the ASSET_CHECK_EVALUATION that's not present above):

Message from the maintainers

Impacted by this issue? Give it a 👍! We factor engagement into prioritization.

MattyKuzyk commented 4 months ago

Small followup, we are printing the PipesMessageHandler object and unsurprisingly for our long running failed jobs we can see it didn't receive a close message:

{'_context': <dagster._core.execution.context.compute.OpExecutionContext object at 0x7f5acc04b700>, '_message_reader': <app.extensions.custom_pipe_log_reader.PipesK8sPodLogsMessageReader object at 0x7f5ad5fc59c0>, '_result_queue': <queue.Queue object at 0x7f5acc0583d0>, '_extra_msg_queue': <queue.Queue object at 0x7f5acc0584c0>, '_received_opened_msg': True, '_received_closed_msg': False, '_opened_payload': None}
Message queue length is: 0

Trying this out with the PipesK8sPodLogsMessageReader would obviously be ideal but I want to stress that it errors out because it's not expecting there to be multiple containers. I can share the code from our extension if necessary but it's fairly straightforward.

EDIT: Just to make sure, we tried launching the sleep process again using the default PipesK8sPodLogsMessageReader, the only override we had to make was passing in a container="sidecar" here. We still experienced the same error described above after 4 hours.

We suspected the Threads we were using in our custom log reader code was squashing an error from core_api.read_namespaced_pod_log, but I still can't see any exception from Kubernetes even using the default log reader.

MattyKuzyk commented 4 months ago

We have learned a lot and have found what we believe to the crux of the issue. In the run function of the PipesK8sClient, the main thread is blocked by consume_pod_logs which in turn calls core_api.read_namespaced_pod_log. Once the connection times out due to the Kubelet timeout mentioned above, that stream will simply end and unblock the main thread. No exception is thrown by Kubernetes.

At this point, the main thread will continue to client.wait_for_pod. No logs will be listened to by Dagster again, and it will never receive any asset checks or close messages. The container will eventually exit and then Dagster will shut it down in the finally block.

To mitigate this, we are overriding the PipesK8sClient as well as the PipesK8sLogsMessageReader and trying to spin the log reading out into its own thread that will keep refreshing the stream until it receives a signal from the main thread that the pod has terminated. Refreshing the log stream is tricky as read_namespaced_pod_log does not offer a good interface for ensuring you don't duplicate or drop logs in between streams. If we duplicate the asset check message for instance, Dagster gets very upset.

jamiedemaria commented 4 months ago

hi @MattyKuzyk sorry for the delayed response here, and thanks for all you work to diagnose the issue. @alangenfeld is this something you would be able to take a look at?

alangenfeld commented 4 months ago

Echoing Jamie thanks for the detailed report and follow up here. Did not know about this 4 hour timeout.

If you didn't have the multi-container constraint i would suggest using one of the blob storage message reader/writers but I speculate tweaking that to merge across multiple containers would be non trivial.

Refreshing the log stream is tricky as read_namespaced_pod_log does not offer a good interface for ensuring you don't duplicate or drop logs in between streams. If we duplicate the asset check message for instance, Dagster gets very upset.

Ya agree this is not as easy as I hoped. Looking at https://github.com/kubernetes-client/python/blob/master/kubernetes/docs/CoreV1Api.md#read_namespaced_pod_log I am thinking

setting timestamps=True to get timestamps we can track. Unsure if we should print them or just strip them
track highest timestamp we have processed, on reconnect skip all lines below that highest timestamp
could use since_seconds to prevent having to re-fetch all logs on reconnect

MattyKuzyk commented 4 months ago

Sorry for the late reply, I was out last week.

We are working on a custom extension that would resolve this issue. We already have a custom extension for multi container support. It will probably be in a rough state for a while but maybe we'll have time to polish it up for an upstream PR sometime this month.

dagster-io / dagster