cumc / dsc

Repo for Dynamic Statistical Comparisons project
https://stephenslab.github.io/dsc-wiki
MIT License
1 stars 0 forks source link

Meta info database reimplementation #6

Open gaow opened 3 years ago

gaow commented 3 years ago

Improve DSC meta information database

Please first install DSC from the development repo:

pip install git+git://github.com/cumc/dsc -U

Problem overview

We use this toy benchmark as an example,

dsc first_investigation.dsc

There will then be two folders in the directory you run the command:

- dsc_result
- .sos

Inside .sos folder there are several files

# Generated by DSC
dsc_result.cfg.pkl
dsc_result.io.meta.pkl
dsc_result.io.pkl
# Generated by SoS as it executes the DSC benchmark
step_signatures.db  
transcript.txt  
workflow_signatures.db

These pkl files generated by DSC has information extracted from the current *.dsc script that is being executed.

Inside dsc_result folder, apart from some folders that contains intermediate results, there are two files:

# Generated by DSC
dsc_result.map.mpk 
dsc_result.db  

dsc_result.map.mpk is meant to preserve information from multiple runs of DSC (@BoPeng: In SoS terminology, a module instance is a "substep"). Everytime a DSC command runs, dsc_result.map.mpk should be updated and not re-written. dsc_result.map.mpk is a key-value (dictionaries) data-bases saved in msgpack format.

dsc_result.db is in pickle format but just with a (arbitary) db file extension. It contains information of current DSC run.

Relationship among these files

Current task

Let's start from reimplementing dsc_result.map.mpk. The goal is to still take dsc_result.cfg.pkl, dsc_result.io.meta.pkl as input, but efficiently update dsc_result.map.mpk consolidating with info from previous runs and generate dsc_result.io.pkl for the current run.

  1. It is updated, and not rewritten, at different DSC runs
  2. Database from different users can easily be merged

Input data explained

.sos/dsc_result.io.meta.pkl

import pickle
pickle.load(open('.sos/dsc_result.io.meta.pkl','rb'))
{1: {'normal': ['normal', 1], 'mean': ['mean', 1], 'abs_err': ['abs_err', 1]},
 2: {'normal': ['normal', 1], 'mean': ['mean', 1], 'sq_err': ['sq_err', 2]},
 3: {'normal': ['normal', 1],
  'median': ['median', 3],
  'abs_err': ['abs_err', 3]},
 4: {'normal': ['normal', 1],
  'median': ['median', 3],
  'sq_err': ['sq_err', 4]},
 5: {'t': ['t', 5], 'mean': ['mean', 5], 'abs_err': ['abs_err', 5]},
 6: {'t': ['t', 5], 'mean': ['mean', 5], 'sq_err': ['sq_err', 6]},
 7: {'t': ['t', 5], 'median': ['median', 7], 'abs_err': ['abs_err', 7]},
 8: {'t': ['t', 5], 'median': ['median', 7], 'sq_err': ['sq_err', 8]}}

.sos/dsc_result.cfg.pkl

pickle.load(open('.sos/dsc_result.cfg.pkl','rb'))
(('normal', 1),
              {('normal:3fce637f',): {'__pipeline_id__': 1,
                '__pipeline_name__': 'a_normal+a_mean+a_abs_err',
                '__module__': 'normal',
                '__out_vars__': ['data', 'true_mean'],
                'DSC_REPLICATE': 1,
                'n': 100,
                'mu': 0},
               '__input_output___': ([], ['normal:3fce637f']),
               '__ext__': 'rds'})
...
...
 (('abs_err', 1),
              {('abs_err:0acdbf79:normal:3fce637f:mean:e3f9ad83:normal:3fce637f',
                'normal:3fce637f',
                'mean:e3f9ad83:normal:3fce637f'): {'__pipeline_id__': 1,
                '__pipeline_name__': 'a_normal+a_mean+a_abs_err',
                '__module__': 'abs_err',
                '__out_vars__': ['error']},
               '__input_output___': (['normal:3fce637f',
                 'mean:e3f9ad83:normal:3fce637f'],
                ['abs_err:0acdbf79:normal:3fce637f:mean:e3f9ad83:normal:3fce637f']),
               '__ext__': 'rds'})
...

This file contains information for each module. It is where most information for updating the map.mpk comes from. Take ('normal', 1) for example:

Now look at a more complicated ('abs_err', 1). The key ('abs_err:0acdbf79:normal:3fce637f:mean:e3f9ad83:normal:3fce637f', 'normal:3fce637f', 'mean:e3f9ad83:normal:3fce637f'): has two components:

dsc_result/dsc_result.map.mpk

This is the database we'll reimplement

import msgpack
msgpack.unpack(open('dsc_result/dsc_result.map.mpk','rb'), encoding='utf-8')
{'normal:3fce637f': 'normal/normal_1.rds',
 'mean:e3f9ad83:normal:3fce637f': 'mean/normal_1_mean_1.rds',
 'abs_err:0acdbf79:normal:3fce637f:mean:e3f9ad83:normal:3fce637f': 'abs_err/normal_1_mean_1_abs_err_1.rds',
 'sq_err:cd547d28:normal:3fce637f:mean:e3f9ad83:normal:3fce637f': 'sq_err/normal_1_mean_1_sq_err_1.rds',
 'median:45c94289:normal:3fce637f': 'median/normal_1_median_1.rds',
 'abs_err:0acdbf79:normal:3fce637f:median:45c94289:normal:3fce637f': 'abs_err/normal_1_median_1_abs_err_1.rds',
 'sq_err:cd547d28:normal:3fce637f:median:45c94289:normal:3fce637f': 'sq_err/normal_1_median_1_sq_err_1.rds',
 't:52a5d4d3': 't/t_1.rds',
 'mean:e3f9ad83:t:52a5d4d3': 'mean/t_1_mean_1.rds',
 'abs_err:0acdbf79:t:52a5d4d3:mean:e3f9ad83:t:52a5d4d3': 'abs_err/t_1_mean_1_abs_err_1.rds',
 'sq_err:cd547d28:t:52a5d4d3:mean:e3f9ad83:t:52a5d4d3': 'sq_err/t_1_mean_1_sq_err_1.rds',
 'median:45c94289:t:52a5d4d3': 'median/t_1_median_1.rds',
 'abs_err:0acdbf79:t:52a5d4d3:median:45c94289:t:52a5d4d3': 'abs_err/t_1_median_1_abs_err_1.rds',
 'sq_err:cd547d28:t:52a5d4d3:median:45c94289:t:52a5d4d3': 'sq_err/t_1_median_1_sq_err_1.rds',
 '__base_ids__': {'normal': {'normal': 1},
  'normal:mean': {'normal': 1, 'mean': 1},
  'normal:mean:abs_err': {'normal': 1, 'mean': 1, 'abs_err': 1},
  'normal:mean:sq_err': {'normal': 1, 'mean': 1, 'sq_err': 1},
  'normal:median': {'normal': 1, 'median': 1},
  'normal:median:abs_err': {'normal': 1, 'median': 1, 'abs_err': 1},
  'normal:median:sq_err': {'normal': 1, 'median': 1, 'sq_err': 1},
  't': {'t': 1},
  't:mean': {'t': 1, 'mean': 1},
  't:mean:abs_err': {'t': 1, 'mean': 1, 'abs_err': 1},
  't:mean:sq_err': {'t': 1, 'mean': 1, 'sq_err': 1},
  't:median': {'t': 1, 'median': 1},
  't:median:abs_err': {'t': 1, 'median': 1, 'abs_err': 1},
  't:median:sq_err': {'t': 1, 'median': 1, 'sq_err': 1}}}

The main content is very simple: 'mean:e3f9ad83:t:52a5d4d3': 'mean/t_1_mean_1.rds' one key corresponding to one unique file name to be saved on disk. However, what's difficult is to efficiently figure out what the file name should be. Take 'mean:e3f9ad83:t:52a5d4d3' for example. It is a mean module taking a t module as input. So it should end up in a mean/ folder, with t_??_mean_?? file name indicating that the pipeline so far has executed t then followed by mean. But we want to assign unique ?? number such that they have a relationship with the module. For example, t:52a5d4d3 will always correspond to t_1. So the file names are easy to read because when we see t_1 we know they are all from the same t module.

Currently my implementation is very simple: everything is written to a file with contents above. When new information needs to be added to it at a next DSC run:

  1. The entire file dsc_result/dsc_result.map.mpk will be loaded
  2. Based on input file dsc_result.cfg.pkl, we see which modules are not in this map database already. If they are here, then return the filename value it corresponds to. Otherwise have to figure out a unique file name for it. This is currently achieved by using a __base_ids__ that keeps track of the numbering up to now and add to it for the new input to generate new file names. I'll not explain it in detail because this design is inefficient and we should do a better job with a new implementation

.sos/dsc_result.io.pkl

pickle.load(open('.sos/dsc_result.io.pkl','rb'))
OrderedDict([('1',
              OrderedDict([('normal',
                            OrderedDict([('input', []),
                                         ('output',
                                          ['dsc_result/normal/normal_1.rds']),
                                         ('depends', [])])),
                           ('mean',
                            OrderedDict([('input',
                                          ['dsc_result/normal/normal_1.rds']),
                                         ('output',
                                          ['dsc_result/mean/normal_1_mean_1.rds']),
                                         ('depends', [('normal', 1)])])),
                           ('abs_err',
                            OrderedDict([('input',
                                          ['dsc_result/normal/normal_1.rds',
                                           'dsc_result/mean/normal_1_mean_1.rds']),
                                         ('output',
                                          ['dsc_result/abs_err/normal_1_mean_1_abs_err_1.rds']),
                                         ('depends',
                                          [('normal', 1), ('mean', 1)])]))])),
             ('2',
              OrderedDict([('normal', ('1', 'normal')),
                           ('mean', ('1', 'mean')),
                           ('sq_err',
                            OrderedDict([('input',
                                          ['dsc_result/normal/normal_1.rds',
                                           'dsc_result/mean/normal_1_mean_1.rds']),
                                         ('output',
                                          ['dsc_result/sq_err/normal_1_mean_1_sq_err_1.rds']),
                                         ('depends',
                                          [('normal', 1), ('mean', 1)])]))])),
('2',
              OrderedDict([('normal', ('1', 'normal')),

Here, in pipeline 2, the normal module is the same as the normal module in pipeline '1', so instead of writing the same information again, it uses this shortcut to keep the information.

Information in .sos/dsc_result.io.pkl combines that from cfg.pkl which has the pipeline info and dependency info, then uses filenames it found in map.mpk to get meaningful filenames.

Proposed new implementation for dsc_result/dsc_result.map.mpk

We should have a database like this:

module file depends parameters
normal:3fce637f normal/normal_1.rds None ...
sq_err:cd547d28:t:52a5d4d3:median:45c94289:t:52a5d4d3 sq_err/t_1_median_1_sq_err_1.rds (t:52a5d4d3, median:45c94289:t:52a5d4d3) ...

where parameters saves all parameters from the cfg.pkl file. The file column should have the unique, human-readable filename generated efficiently. This is perhaps the most difficult feature to implement.

This database should support efficient row addition and deletions.

gaow commented 3 years ago

This task should be easier than it sounds ... I went extra length to explain what we have above, to make sure all details are covered. But essentially this boils down to

  1. Choosing a reliable, portable database implementation for dsc_result.map.mpk file: efficient query, addition, deletion, supports complex data types in the parameter column, and easy to merge multiple such databases
  2. With the new database in place, an efficient algorithm to figure out file names in the file column.

@junyanj1 I'm assigning you to this task but we can all discuss here.