Improve DSC meta information database

Please first install DSC from the development repo:

pip install git+git://github.com/cumc/dsc -U

Problem overview

We use this toy benchmark as an example,

dsc first_investigation.dsc

There will then be two folders in the directory you run the command:

- dsc_result
- .sos

Inside .sos folder there are several files

# Generated by DSC
dsc_result.cfg.pkl
dsc_result.io.meta.pkl
dsc_result.io.pkl
# Generated by SoS as it executes the DSC benchmark
step_signatures.db  
transcript.txt  
workflow_signatures.db

These pkl files generated by DSC has information extracted from the current *.dsc script that is being executed.

Inside dsc_result folder, apart from some folders that contains intermediate results, there are two files:

# Generated by DSC
dsc_result.map.mpk 
dsc_result.db

dsc_result.map.mpk is meant to preserve information from multiple runs of DSC (@BoPeng: In SoS terminology, a module instance is a "substep"). Everytime a DSC command runs, dsc_result.map.mpk should be updated and not re-written. dsc_result.map.mpk is a key-value (dictionaries) data-bases saved in msgpack format.

dsc_result.db is in pickle format but just with a (arbitary) db file extension. It contains information of current DSC run.

Relationship among these files

dsc_result.cfg.pkl, dsc_result.io.meta.pkl = generated_from(dsc_script)
dsc_result.map.mpk, dsc_result.io.pkl = generated_from(dsc_result.cfg.pkl, dsc_result.io.meta.pkl) via this function
dsc_result.db = generated_from(dsc_result.cfg.pkl, dsc_result.io.meta.pkl, dsc_result.map.mpk) via this class

Current task

Let's start from reimplementing dsc_result.map.mpk. The goal is to still take dsc_result.cfg.pkl, dsc_result.io.meta.pkl as input, but efficiently update dsc_result.map.mpk consolidating with info from previous runs and generate dsc_result.io.pkl for the current run.

It is updated, and not rewritten, at different DSC runs
Database from different users can easily be merged

Input data explained

`.sos/dsc_result.io.meta.pkl`

import pickle
pickle.load(open('.sos/dsc_result.io.meta.pkl','rb'))

{1: {'normal': ['normal', 1], 'mean': ['mean', 1], 'abs_err': ['abs_err', 1]},
 2: {'normal': ['normal', 1], 'mean': ['mean', 1], 'sq_err': ['sq_err', 2]},
 3: {'normal': ['normal', 1],
  'median': ['median', 3],
  'abs_err': ['abs_err', 3]},
 4: {'normal': ['normal', 1],
  'median': ['median', 3],
  'sq_err': ['sq_err', 4]},
 5: {'t': ['t', 5], 'mean': ['mean', 5], 'abs_err': ['abs_err', 5]},
 6: {'t': ['t', 5], 'mean': ['mean', 5], 'sq_err': ['sq_err', 6]},
 7: {'t': ['t', 5], 'median': ['median', 7], 'abs_err': ['abs_err', 7]},
 8: {'t': ['t', 5], 'median': ['median', 7], 'sq_err': ['sq_err', 8]}}

Each key is a pipeline ID, corresponding to a pipeline.
'normal': ['normal', 1] means the normal module here is the same normal module as used in pipeline 1. Because as you can see the first 4 pipelines share the same normal module. This meta information tells us what modules are shared between pipelines.

`.sos/dsc_result.cfg.pkl`

pickle.load(open('.sos/dsc_result.cfg.pkl','rb'))

(('normal', 1),
              {('normal:3fce637f',): {'__pipeline_id__': 1,
                '__pipeline_name__': 'a_normal+a_mean+a_abs_err',
                '__module__': 'normal',
                '__out_vars__': ['data', 'true_mean'],
                'DSC_REPLICATE': 1,
                'n': 100,
                'mu': 0},
               '__input_output___': ([], ['normal:3fce637f']),
               '__ext__': 'rds'})
...
...
 (('abs_err', 1),
              {('abs_err:0acdbf79:normal:3fce637f:mean:e3f9ad83:normal:3fce637f',
                'normal:3fce637f',
                'mean:e3f9ad83:normal:3fce637f'): {'__pipeline_id__': 1,
                '__pipeline_name__': 'a_normal+a_mean+a_abs_err',
                '__module__': 'abs_err',
                '__out_vars__': ['error']},
               '__input_output___': (['normal:3fce637f',
                 'mean:e3f9ad83:normal:3fce637f'],
                ['abs_err:0acdbf79:normal:3fce637f:mean:e3f9ad83:normal:3fce637f']),
               '__ext__': 'rds'})
...

This file contains information for each module. It is where most information for updating the map.mpk comes from. Take ('normal', 1) for example:

('normal', 1) means this is the normal module in the pipeline 1.
The key 'normal:3fce637f' is its unique ID. This is determined by its input and parameters, and the MD5SUM of the script for it. Any changes to input and parameters will result in a different ID. This is gauranteed. The key is relevant to building map.mpk database.
The value corresponding to 'normal:3fce637f' are parameter values. Bascially 'normal:3fce637f' is the HASH of these values and the MD5SUM of the script for the module (the script is not included here). These values are relevant to building db database.
'__input_output___': the first element is the input to this module, the 2nd element is output of this module
'__ext__': extension of the output file from this module. In current DSC, each module outputs one file. For R, it is rds, for Python it is pkl -- these files contain output variables in DSC. For Bash it is yaml (a meta file).

Now look at a more complicated ('abs_err', 1). The key ('abs_err:0acdbf79:normal:3fce637f:mean:e3f9ad83:normal:3fce637f', 'normal:3fce637f', 'mean:e3f9ad83:normal:3fce637f'): has two components:

The first component, abs_err:0acdbf79:normal:3fce637f:mean:e3f9ad83:normal:3fce637f is the module instance ID. It is made up of itself, and all its dependency modules (its input). Because abs_err takes input from normal and mean. And mean takes input from normal.
The rest of it, normal:3fce637f and mean:e3f9ad83:normal:3fce637f are dependencies of abs_err:0acdbf79:normal:3fce637f:mean:e3f9ad83:normal:3fce637f.

`dsc_result/dsc_result.map.mpk`

This is the database we'll reimplement

import msgpack
msgpack.unpack(open('dsc_result/dsc_result.map.mpk','rb'), encoding='utf-8')

{'normal:3fce637f': 'normal/normal_1.rds',
 'mean:e3f9ad83:normal:3fce637f': 'mean/normal_1_mean_1.rds',
 'abs_err:0acdbf79:normal:3fce637f:mean:e3f9ad83:normal:3fce637f': 'abs_err/normal_1_mean_1_abs_err_1.rds',
 'sq_err:cd547d28:normal:3fce637f:mean:e3f9ad83:normal:3fce637f': 'sq_err/normal_1_mean_1_sq_err_1.rds',
 'median:45c94289:normal:3fce637f': 'median/normal_1_median_1.rds',
 'abs_err:0acdbf79:normal:3fce637f:median:45c94289:normal:3fce637f': 'abs_err/normal_1_median_1_abs_err_1.rds',
 'sq_err:cd547d28:normal:3fce637f:median:45c94289:normal:3fce637f': 'sq_err/normal_1_median_1_sq_err_1.rds',
 't:52a5d4d3': 't/t_1.rds',
 'mean:e3f9ad83:t:52a5d4d3': 'mean/t_1_mean_1.rds',
 'abs_err:0acdbf79:t:52a5d4d3:mean:e3f9ad83:t:52a5d4d3': 'abs_err/t_1_mean_1_abs_err_1.rds',
 'sq_err:cd547d28:t:52a5d4d3:mean:e3f9ad83:t:52a5d4d3': 'sq_err/t_1_mean_1_sq_err_1.rds',
 'median:45c94289:t:52a5d4d3': 'median/t_1_median_1.rds',
 'abs_err:0acdbf79:t:52a5d4d3:median:45c94289:t:52a5d4d3': 'abs_err/t_1_median_1_abs_err_1.rds',
 'sq_err:cd547d28:t:52a5d4d3:median:45c94289:t:52a5d4d3': 'sq_err/t_1_median_1_sq_err_1.rds',
 '__base_ids__': {'normal': {'normal': 1},
  'normal:mean': {'normal': 1, 'mean': 1},
  'normal:mean:abs_err': {'normal': 1, 'mean': 1, 'abs_err': 1},
  'normal:mean:sq_err': {'normal': 1, 'mean': 1, 'sq_err': 1},
  'normal:median': {'normal': 1, 'median': 1},
  'normal:median:abs_err': {'normal': 1, 'median': 1, 'abs_err': 1},
  'normal:median:sq_err': {'normal': 1, 'median': 1, 'sq_err': 1},
  't': {'t': 1},
  't:mean': {'t': 1, 'mean': 1},
  't:mean:abs_err': {'t': 1, 'mean': 1, 'abs_err': 1},
  't:mean:sq_err': {'t': 1, 'mean': 1, 'sq_err': 1},
  't:median': {'t': 1, 'median': 1},
  't:median:abs_err': {'t': 1, 'median': 1, 'abs_err': 1},
  't:median:sq_err': {'t': 1, 'median': 1, 'sq_err': 1}}}

The main content is very simple: 'mean:e3f9ad83:t:52a5d4d3': 'mean/t_1_mean_1.rds' one key corresponding to one unique file name to be saved on disk. However, what's difficult is to efficiently figure out what the file name should be. Take 'mean:e3f9ad83:t:52a5d4d3' for example. It is a mean module taking a t module as input. So it should end up in a mean/ folder, with t_??_mean_?? file name indicating that the pipeline so far has executed t then followed by mean. But we want to assign unique ?? number such that they have a relationship with the module. For example, t:52a5d4d3 will always correspond to t_1. So the file names are easy to read because when we see t_1 we know they are all from the same t module.

Currently my implementation is very simple: everything is written to a file with contents above. When new information needs to be added to it at a next DSC run:

The entire file dsc_result/dsc_result.map.mpk will be loaded
Based on input file dsc_result.cfg.pkl, we see which modules are not in this map database already. If they are here, then return the filename value it corresponds to. Otherwise have to figure out a unique file name for it. This is currently achieved by using a __base_ids__ that keeps track of the numbering up to now and add to it for the new input to generate new file names. I'll not explain it in detail because this design is inefficient and we should do a better job with a new implementation

`.sos/dsc_result.io.pkl`

pickle.load(open('.sos/dsc_result.io.pkl','rb'))

OrderedDict([('1',
              OrderedDict([('normal',
                            OrderedDict([('input', []),
                                         ('output',
                                          ['dsc_result/normal/normal_1.rds']),
                                         ('depends', [])])),
                           ('mean',
                            OrderedDict([('input',
                                          ['dsc_result/normal/normal_1.rds']),
                                         ('output',
                                          ['dsc_result/mean/normal_1_mean_1.rds']),
                                         ('depends', [('normal', 1)])])),
                           ('abs_err',
                            OrderedDict([('input',
                                          ['dsc_result/normal/normal_1.rds',
                                           'dsc_result/mean/normal_1_mean_1.rds']),
                                         ('output',
                                          ['dsc_result/abs_err/normal_1_mean_1_abs_err_1.rds']),
                                         ('depends',
                                          [('normal', 1), ('mean', 1)])]))])),
             ('2',
              OrderedDict([('normal', ('1', 'normal')),
                           ('mean', ('1', 'mean')),
                           ('sq_err',
                            OrderedDict([('input',
                                          ['dsc_result/normal/normal_1.rds',
                                           'dsc_result/mean/normal_1_mean_1.rds']),
                                         ('output',
                                          ['dsc_result/sq_err/normal_1_mean_1_sq_err_1.rds']),
                                         ('depends',
                                          [('normal', 1), ('mean', 1)])]))])),

Basic format is very clean. For each pipeline, and for each module, what is the input should it take (input), and what are the modules these input are generated from (depends), and what are the output from this module.
There is also a shortcut:

('2',
              OrderedDict([('normal', ('1', 'normal')),

Here, in pipeline 2, the normal module is the same as the normal module in pipeline '1', so instead of writing the same information again, it uses this shortcut to keep the information.

Information in .sos/dsc_result.io.pkl combines that from cfg.pkl which has the pipeline info and dependency info, then uses filenames it found in map.mpk to get meaningful filenames.

Proposed new implementation for `dsc_result/dsc_result.map.mpk`

We should have a database like this:

module	file	depends	parameters
normal:3fce637f	normal/normal_1.rds	None	...
sq_err:cd547d28:t:52a5d4d3:median:45c94289:t:52a5d4d3	sq_err/t_1_median_1_sq_err_1.rds	(t:52a5d4d3, median:45c94289:t:52a5d4d3)	...

where parameters saves all parameters from the cfg.pkl file. The file column should have the unique, human-readable filename generated efficiently. This is perhaps the most difficult feature to implement.

This database should support efficient row addition and deletions.

cumc / dsc

Meta info database reimplementation #6

Improve DSC meta information database

Problem overview

Relationship among these files

Current task

Input data explained

`.sos/dsc_result.io.meta.pkl`

`.sos/dsc_result.cfg.pkl`

`dsc_result/dsc_result.map.mpk`

`.sos/dsc_result.io.pkl`

Proposed new implementation for `dsc_result/dsc_result.map.mpk`

cumc / dsc

Meta info database reimplementation #6

Improve DSC meta information database

Problem overview

Relationship among these files

Current task

Input data explained

.sos/dsc_result.io.meta.pkl

.sos/dsc_result.cfg.pkl

dsc_result/dsc_result.map.mpk

.sos/dsc_result.io.pkl

Proposed new implementation for dsc_result/dsc_result.map.mpk

`.sos/dsc_result.io.meta.pkl`

`.sos/dsc_result.cfg.pkl`

`dsc_result/dsc_result.map.mpk`

`.sos/dsc_result.io.pkl`

Proposed new implementation for `dsc_result/dsc_result.map.mpk`