OSOceanAcoustics / echodataflow

Orchestrated sonar data processing workflow
https://echodataflow.readthedocs.io/en/latest/
MIT License
4 stars 1 forks source link

Log Contamination Across Pipeline Runs in Dask Cluster Setup #102

Closed Sohambutala closed 3 months ago

Sohambutala commented 4 months ago

Description

Currently, we log using Dask streams. Underneath, events are created by workers and then retrieved by the main program once control returns to the main program.

The problem occurs when we have a Dask cluster setup and run pipelines on the cluster. Log events from previous pipeline runs are also added to the current logs of the pipeline, leading to log contamination. This makes it difficult to isolate logs for individual pipeline runs and leads to confusion in interpreting the logs.

Expected Behavior

Each pipeline run should have its own isolated logs, without contamination from previous runs.

Actual Behavior

Logs from previous pipeline runs are included in the logs of the current pipeline, leading to log contamination.

Environment (please complete the following information):

Steps to Reproduce

Steps to reproduce the behavior:

  1. Set up a Dask cluster.
  2. Run a pipeline that generates logs using Dask streams.
  3. Observe the logs.
  4. Run a second pipeline on the same cluster.
  5. Observe that logs from the first pipeline are also present in the second pipeline's logs.

Possible Solution / Suggestion

Investigate the mechanism by which Dask streams handle log events and ensure that log events are properly segregated for each pipeline run. Consider implementing a mechanism to clear or reset log events at the start of each pipeline run.