Dagster stdout/stderr log capture sometimes prevents the final lines in a step from writing to process output

gibsondan commented 1 month ago

Dagster version

1.7.15

What's the issue?

Dagster's stdout/stderr functionality works by redirecting the stdout/stderr streams to a file on the local filesystem. To ensure that the captured log lines still appear in the "real" stdout/stderr, it also runs a tailer process that captures the local file and prints it back to the real stdout/stderr.

Sometimes (particularly in a containerized environment - I found this harder to reproduce locally) if events or log lines come in right at the end of the step, the tailer process does not have time to print them out to the process output before it is terminated, causing them to not appear in the process output. See: https://github.com/dagster-io/dagster/blob/20e8596e810ded6327052e7c622c39b940241c28/python_modules/dagster/dagster/_core/execution/compute_logs.py#L118-L146

What did you expect to happen?

All log lines from the step to be logged

How to reproduce?

Run the examples/deploy_docker example, launch a run, view container logs from the run, see missing log lines (STEP_START / STEP_SUCCESS and several others are all missing because the step in that example finishes right away)

Deployment type

None

Deployment details

No response

Additional information

No response

Message from the maintainers

Impacted by this issue? Give it a 👍! We factor engagement into prioritization.

HynekBlaha commented 1 week ago

Hello @gibsondan, we are affected by this issue. After enabling azureBlobComputeLogManagerConfig on Dagster Kubernetes, the container logs are incomplete and our Grafana alerts based on Loki source started firing with NoData reason.

From my point of view these are the key points:

My real case observation: The last container log was at: "2024-09-07T06:15:42.178000Z". In the file, there were additional logs with timestamps: "2024-09-07T06:15:45.798000Z", "2024-09-07T06:15:45.827000Z" and "2024-09-07T06:15:45.904000Z" (almost 4s gap, but 0.1s from first missing).

If the line buffering works as I expect, I would understand if only the last log message at "2024-09-07T06:15:45.904000Z" was missing. If more are missing, the only possible root cause is the tail process having too long interval.

What would be the safest (least invasive) way to fix this? Do you think it would be enough to add extra sleep(n) before the os.kill? https://github.com/dagster-io/dagster/blob/5a828823b20b9f6544a04afd51ab2de720347f7c/python_modules/dagster/dagster/_core/execution/scripts/watch_orphans.py#L33

Another solution might be to remove the watcher process and use --pid param (man: tail(1))

--pid=PID
       with -f, terminate after process ID, PID dies; 
       can be repeated to watch multiple processes

but this does not work for mac. So maybe keep mac on current flow and add it only for linux? This way it could be easily fixed for dagster-k8s.

I would gladly volunteer to implement this fix for linux, if you approve.

Thanks!

gibsondan commented 3 days ago

@HynekBlaha the --pid argument direction sounds promising to me - if you're willing to try it that sounds great!

dagster-io / dagster