Closed axiomcura closed 2 years ago
@gwaybio I have attended all your comments and suggestions. I have responded your questions below:
Do you think it is worth printing the configuration files inside the logs as well?
This is an interesting question!
The issue by printing all the configurations within the log file will be extremely unreadable. We can develop a parser that extract the contents within the config files and formats them into a readable format and write it into the log file.
From what I have experience with the python's logging
package is that it is very strict. I do not know if logging
module allows multi line logging. That's something to research in the future.
An alternative and cheap approach would be to create a folder like used_configs
within the archive_log
directory and copy the config files into the used_configs
directory.
With all print statements moved inside logs, are you at all concerned now about an overly silent runtime?
Absolutely. It will be more convenient to see major messages being printed in the terminal during runtime. Thankfully, the logging
module has the option to print out messages into the terminal
Overall, issue #9 will remain open. Your comments and suggestions brought some important points that will be implemented in the future. For now, I will make note of them in #9 and merge the PR
Motivation
Issue #9 explains the need of a logging system in order keeps track of all processes and generated files when executing cytokine
CytoPipe
One of the challenges of this task is dealing with parallelized processes, which introduces
data_race
issues when multiple cores are interacting with one log file.Approach
To approach this issue, a separate
snakemake
rule was developed for only merging logs. Meaning, within theSnakefile
, the very last step has to bemerge_files.smk
The
merge_files.smk
contains a single rule the acknowledges all of the generated logs that are explicitly stated within theSnakemake
file :The
LOG_NAMES
constant contains the names of the generated processes after all former processes are conducted.Here the
expand
function generates a list of log paths that are dictated by the wildcardlogname
. Thelogname
wild card is assigned to generated list of log names that has been initiated within theSnakefile
under theLOG_NAMES
constant variableHere is a simple image example on how the logging procedure shown above works:
NOTE: this is not the true architecture of a DAG, this is only illustrating the generated files that are produced within each DAG and how the
merge_file.smk
interacts with them.In this example, there are two separate
DAGs
and each tag contains rules that are associated with processes. InDAG2
,Rule2
is parallelizable due to the capability of spawning multipleprocesses
.Each
process
event will generate an output and log files.Once all the expected outputs are generated, the last steps is to merge all the log files into one file. Below is design concept on how
merge_logs.smk
works.Above shows that
merge_logs.smk
acknowledges all the generated log outputs and uses them as inputs for themerge_logs.py
script.In the
merge_logs.py
script, the log files are order by the time created (earliest to latest). Then it extracts all contents from the log files and converts it into a Panda’s data frame, which provides easy functionality to sort and edit the log entries of all files.All the contents within the data frame is written into a single log files
merged_log.log
and the individual log files are archived into a separate folder within the/logs
directory.Before
merge_logs.py
:After
merge_logs.py
The
CytoPipe_preprocess.log
is the merged log that contains the latests execution log. It will remain outside of the archived logs but will be over written if another processes is executed. If a user wishes to view the previous merged logs, they can always refer to archived logs, which will contains a copy of theCytoPipe_preprocess.log