Logging implementation for CytoPipe

Motivation

Issue #9 explains the need of a logging system in order keeps track of all processes and generated files when executing cytokine CytoPipe

One of the challenges of this task is dealing with parallelized processes, which introduces data_race issues when multiple cores are interacting with one log file.

Approach

To approach this issue, a separate snakemake rule was developed for only merging logs. Meaning, within the Snakefile, the very last step has to be merge_files.smk

The merge_files.smk contains a single rule the acknowledges all of the generated logs that are explicitly stated within the Snakemake file :

# Snakefile

# importing DAGs
include: "rules/preprocessing.smk"
include: "rules/feature_select.smk"

# appending logs
LOG_NAMES = glob_wildcards("logs/{log_name}.log").log_name
include: "rules/merge_logs.smk"

The LOG_NAMES constant contains the names of the generated processes after all former processes are conducted.

# merge_logs.smk

rule merge_logs:
    input:
        expand("logs/{logname}.log", logname=LOG_NAMES),
    output:
        "logs/CytoPipe_preprocess.log",
    script:
        "../scripts/merge_logs.py"

Here the expand function generates a list of log paths that are dictated by the wildcard logname. The logname wild card is assigned to generated list of log names that has been initiated within the Snakefile under the LOG_NAMES constant variable

Here is a simple image example on how the logging procedure shown above works:

NOTE: this is not the true architecture of a DAG, this is only illustrating the generated files that are produced within each DAG and how the merge_file.smk interacts with them.

flowchart LR
%% first DAG
D1(DAG1) --> D1R1(Rule1)
D1R1 --> D1P1(Process) --> D1L1(D1_Rule1_Log)
D1P1 --> D1O1(D1_Rule1_output)
D1 --> D1R2(Rule2) --> D1P2(Process) --> D1L2(D1_Rule2_Log)
D1P2 --> D1O2(D1_Rul2_output)

%% Second DAG
D2(DAG2) --> D2R1(Rule1)
D2R1 --> D2R1P1(process) --> D2R1O1(D2_Rule1_output)
D2R1P1 --> D2R1L1(D1_Rule1_Log)
D2 --> D2R2(Rule2)
D2R2 --> D2P1(Process) --> D2L1(D2_Rule2_Log1)
D2P1 --> D2O1(D2 Rule2 output)
D2R2 --> D2P2(Process) --> D2L2(D2_Rule2_Log2)
D2P2 --> D2O2(D2 Rule2 output)
D2R2 --> D2P3(Process) --> D2L3(D2_Rule2_Log3)
D2P3 --> D2O3(D2 Rule2 output)

In this example, there are two separate DAGs and each tag contains rules that are associated with processes. In DAG2, Rule2 is parallelizable due to the capability of spawning multiple processes .

Each process event will generate an output and log files.

Once all the expected outputs are generated, the last steps is to merge all the log files into one file. Below is design concept on how merge_logs.smk works.

flowchart LR
D1L1(D1_Rule1_Log) --> S(merge_logs.py)
D1L2(D1_Rule2_Log) --> S
D2R1L1(D1_Rule1_Log) --> S
D2L1(D2_Rule2_Log1) --> S
D2L2(D2_Rule2_Log2) --> S
D2L3(D2_Rule2_Log3) --> S
S --> merged_log.log

Above shows that merge_logs.smk acknowledges all the generated log outputs and uses them as inputs for the merge_logs.py script.

In the merge_logs.py script, the log files are order by the time created (earliest to latest). Then it extracts all contents from the log files and converts it into a Panda’s data frame, which provides easy functionality to sort and edit the log entries of all files.

All the contents within the data frame is written into a single log files merged_log.log and the individual log files are archived into a separate folder within the /logs directory.

Before merge_logs.py:

aggregate_SQ00014610.log  aggregate_SQ00014616.log  annotate_SQ00014613.log  create_consensus.log       normalized_SQ00014613.log
aggregate_SQ00014611.log  aggregate_SQ00014617.log  annotate_SQ00014614.log  normalized_SQ00014618.log  normalized_SQ00014614.log
aggregate_SQ00014612.log  aggregate_SQ00014618.log  annotate_SQ00014615.log  feature_select.log         normalized_SQ00014615.log
aggregate_SQ00014613.log  annotate_SQ00014610.log   annotate_SQ00014616.log  normalized_SQ00014610.log  normalized_SQ00014616.log
aggregate_SQ00014614.log  annotate_SQ00014611.log   annotate_SQ00014617.log  normalized_SQ00014611.log  normalized_SQ00014617.log
aggregate_SQ00014615.log  annotate_SQ00014612.log   annotate_SQ00014618.log  normalized_SQ00014612.log

After merge_logs.py

072922-080400-archived_logs  CytoPipe_preprocess.log

The CytoPipe_preprocess.log is the merged log that contains the latests execution log. It will remain outside of the archived logs but will be over written if another processes is executed. If a user wishes to view the previous merged logs, they can always refer to archived logs, which will contains a copy of the CytoPipe_preprocess.log

WayScience / CytoSnake

Logging implementation for CytoPipe #14

Motivation

Approach