WayScience / CytoSnake

Orchestrating high-dimensional cell morphology data processing pipelines
https://cytosnake.readthedocs.io
Creative Commons Attribution 4.0 International
3 stars 3 forks source link

Logging implementation for CytoPipe #14

Closed axiomcura closed 2 years ago

axiomcura commented 2 years ago

Motivation

Issue #9 explains the need of a logging system in order keeps track of all processes and generated files when executing cytokine CytoPipe

One of the challenges of this task is dealing with parallelized processes, which introduces data_race issues when multiple cores are interacting with one log file.

Approach

To approach this issue, a separate snakemake rule was developed for only merging logs. Meaning, within the Snakefile, the very last step has to be merge_files.smk

The merge_files.smk contains a single rule the acknowledges all of the generated logs that are explicitly stated within the Snakemake file :

# Snakefile

# importing DAGs
include: "rules/preprocessing.smk"
include: "rules/feature_select.smk"

# appending logs
LOG_NAMES = glob_wildcards("logs/{log_name}.log").log_name
include: "rules/merge_logs.smk"

The LOG_NAMES constant contains the names of the generated processes after all former processes are conducted.

# merge_logs.smk

rule merge_logs:
    input:
        expand("logs/{logname}.log", logname=LOG_NAMES),
    output:
        "logs/CytoPipe_preprocess.log",
    script:
        "../scripts/merge_logs.py"

Here the expand function generates a list of log paths that are dictated by the wildcard logname. The logname wild card is assigned to generated list of log names that has been initiated within the Snakefile under the LOG_NAMES constant variable

Here is a simple image example on how the logging procedure shown above works:

NOTE: this is not the true architecture of a DAG, this is only illustrating the generated files that are produced within each DAG and how the merge_file.smk interacts with them.

flowchart LR
%% first DAG
D1(DAG1) --> D1R1(Rule1)
D1R1 --> D1P1(Process) --> D1L1(D1_Rule1_Log)
D1P1 --> D1O1(D1_Rule1_output)
D1 --> D1R2(Rule2) --> D1P2(Process) --> D1L2(D1_Rule2_Log)
D1P2 --> D1O2(D1_Rul2_output)

%% Second DAG
D2(DAG2) --> D2R1(Rule1)
D2R1 --> D2R1P1(process) --> D2R1O1(D2_Rule1_output)
D2R1P1 --> D2R1L1(D1_Rule1_Log)
D2 --> D2R2(Rule2)
D2R2 --> D2P1(Process) --> D2L1(D2_Rule2_Log1)
D2P1 --> D2O1(D2 Rule2 output)
D2R2 --> D2P2(Process) --> D2L2(D2_Rule2_Log2)
D2P2 --> D2O2(D2 Rule2 output)
D2R2 --> D2P3(Process) --> D2L3(D2_Rule2_Log3)
D2P3 --> D2O3(D2 Rule2 output)

In this example, there are two separate DAGs and each tag contains rules that are associated with processes. In DAG2, Rule2 is parallelizable due to the capability of spawning multiple processes .

Each process event will generate an output and log files.

Once all the expected outputs are generated, the last steps is to merge all the log files into one file. Below is design concept on how merge_logs.smk works.

flowchart LR
D1L1(D1_Rule1_Log) --> S(merge_logs.py)
D1L2(D1_Rule2_Log) --> S
D2R1L1(D1_Rule1_Log) --> S
D2L1(D2_Rule2_Log1) --> S
D2L2(D2_Rule2_Log2) --> S
D2L3(D2_Rule2_Log3) --> S
S --> merged_log.log

Above shows that merge_logs.smk acknowledges all the generated log outputs and uses them as inputs for the merge_logs.py script.

In the merge_logs.py script, the log files are order by the time created (earliest to latest). Then it extracts all contents from the log files and converts it into a Panda’s data frame, which provides easy functionality to sort and edit the log entries of all files.

All the contents within the data frame is written into a single log files merged_log.log and the individual log files are archived into a separate folder within the /logs directory.

Before merge_logs.py:

aggregate_SQ00014610.log  aggregate_SQ00014616.log  annotate_SQ00014613.log  create_consensus.log       normalized_SQ00014613.log
aggregate_SQ00014611.log  aggregate_SQ00014617.log  annotate_SQ00014614.log  normalized_SQ00014618.log  normalized_SQ00014614.log
aggregate_SQ00014612.log  aggregate_SQ00014618.log  annotate_SQ00014615.log  feature_select.log         normalized_SQ00014615.log
aggregate_SQ00014613.log  annotate_SQ00014610.log   annotate_SQ00014616.log  normalized_SQ00014610.log  normalized_SQ00014616.log
aggregate_SQ00014614.log  annotate_SQ00014611.log   annotate_SQ00014617.log  normalized_SQ00014611.log  normalized_SQ00014617.log
aggregate_SQ00014615.log  annotate_SQ00014612.log   annotate_SQ00014618.log  normalized_SQ00014612.log  

After merge_logs.py

072922-080400-archived_logs  CytoPipe_preprocess.log

The CytoPipe_preprocess.log is the merged log that contains the latests execution log. It will remain outside of the archived logs but will be over written if another processes is executed. If a user wishes to view the previous merged logs, they can always refer to archived logs, which will contains a copy of the CytoPipe_preprocess.log

axiomcura commented 2 years ago

@gwaybio I have attended all your comments and suggestions. I have responded your questions below:

Do you think it is worth printing the configuration files inside the logs as well?

This is an interesting question!

The issue by printing all the configurations within the log file will be extremely unreadable. We can develop a parser that extract the contents within the config files and formats them into a readable format and write it into the log file.

From what I have experience with the python's logging package is that it is very strict. I do not know if logging module allows multi line logging. That's something to research in the future.

An alternative and cheap approach would be to create a folder like used_configs within the archive_log directory and copy the config files into the used_configs directory.

With all print statements moved inside logs, are you at all concerned now about an overly silent runtime?

Absolutely. It will be more convenient to see major messages being printed in the terminal during runtime. Thankfully, the logging module has the option to print out messages into the terminal


Overall, issue #9 will remain open. Your comments and suggestions brought some important points that will be implemented in the future. For now, I will make note of them in #9 and merge the PR