Open gibsondan opened 1 month ago
Hello @gibsondan, we are affected by this issue. After enabling azureBlobComputeLogManagerConfig on Dagster Kubernetes, the container logs are incomplete and our Grafana alerts based on Loki source started firing with NoData reason.
From my point of view these are the key points:
My real case observation:
The last container log was at: "2024-09-07T06:15:42.178000Z"
.
In the file, there were additional logs with timestamps: "2024-09-07T06:15:45.798000Z"
, "2024-09-07T06:15:45.827000Z"
and "2024-09-07T06:15:45.904000Z"
(almost 4s gap, but 0.1s from first missing).
If the line buffering works as I expect, I would understand if only the last log message at "2024-09-07T06:15:45.904000Z"
was missing.
If more are missing, the only possible root cause is the tail process having too long interval.
What would be the safest (least invasive) way to fix this? Do you think it would be enough to add extra sleep(n) before the os.kill? https://github.com/dagster-io/dagster/blob/5a828823b20b9f6544a04afd51ab2de720347f7c/python_modules/dagster/dagster/_core/execution/scripts/watch_orphans.py#L33
Another solution might be to remove the watcher process and use --pid param (man: tail(1))
--pid=PID
with -f, terminate after process ID, PID dies;
can be repeated to watch multiple processes
but this does not work for mac. So maybe keep mac on current flow and add it only for linux? This way it could be easily fixed for dagster-k8s.
I would gladly volunteer to implement this fix for linux, if you approve.
Thanks!
@HynekBlaha the --pid argument direction sounds promising to me - if you're willing to try it that sounds great!
Dagster version
1.7.15
What's the issue?
Dagster's stdout/stderr functionality works by redirecting the stdout/stderr streams to a file on the local filesystem. To ensure that the captured log lines still appear in the "real" stdout/stderr, it also runs a tailer process that captures the local file and prints it back to the real stdout/stderr.
Sometimes (particularly in a containerized environment - I found this harder to reproduce locally) if events or log lines come in right at the end of the step, the tailer process does not have time to print them out to the process output before it is terminated, causing them to not appear in the process output. See: https://github.com/dagster-io/dagster/blob/20e8596e810ded6327052e7c622c39b940241c28/python_modules/dagster/dagster/_core/execution/compute_logs.py#L118-L146
What did you expect to happen?
All log lines from the step to be logged
How to reproduce?
Run the examples/deploy_docker example, launch a run, view container logs from the run, see missing log lines (STEP_START / STEP_SUCCESS and several others are all missing because the step in that example finishes right away)
Deployment type
None
Deployment details
No response
Additional information
No response
Message from the maintainers
Impacted by this issue? Give it a 👍! We factor engagement into prioritization.