Open gaow opened 4 years ago
This task should be easier than it sounds ... I went extra length to explain what we have above, to make sure all details are covered. But essentially this boils down to
dsc_result.map.mpk
file: efficient query, addition, deletion, supports complex data types in the parameter
column, and easy to merge multiple such databasesfile
column. @junyanj1 I'm assigning you to this task but we can all discuss here.
Improve DSC meta information database
Please first install DSC from the development repo:
Problem overview
We use this toy benchmark as an example,
There will then be two folders in the directory you run the command:
Inside
.sos
folder there are several filesThese
pkl
files generated by DSC has information extracted from the current*.dsc
script that is being executed.Inside
dsc_result
folder, apart from some folders that contains intermediate results, there are two files:dsc_result.map.mpk
is meant to preserve information from multiple runs of DSC (@BoPeng: In SoS terminology, a module instance is a "substep"). Everytime a DSC command runs,dsc_result.map.mpk
should be updated and not re-written.dsc_result.map.mpk
is a key-value (dictionaries) data-bases saved inmsgpack
format.dsc_result.db
is in pickle format but just with a (arbitary)db
file extension. It contains information of current DSC run.Relationship among these files
dsc_result.cfg.pkl, dsc_result.io.meta.pkl = generated_from(dsc_script)
dsc_result.map.mpk, dsc_result.io.pkl = generated_from(dsc_result.cfg.pkl, dsc_result.io.meta.pkl)
via this functiondsc_result.db = generated_from(dsc_result.cfg.pkl, dsc_result.io.meta.pkl, dsc_result.map.mpk)
via this classCurrent task
Let's start from reimplementing
dsc_result.map.mpk
. The goal is to still takedsc_result.cfg.pkl, dsc_result.io.meta.pkl
as input, but efficiently updatedsc_result.map.mpk
consolidating with info from previous runs and generatedsc_result.io.pkl
for the current run.Input data explained
.sos/dsc_result.io.meta.pkl
'normal': ['normal', 1]
means thenormal
module here is the samenormal
module as used in pipeline1
. Because as you can see the first 4 pipelines share the samenormal
module. This meta information tells us what modules are shared between pipelines..sos/dsc_result.cfg.pkl
This file contains information for each module. It is where most information for updating the
map.mpk
comes from. Take('normal', 1)
for example:('normal', 1)
means this is the normal module in the pipeline1
.'normal:3fce637f'
is its unique ID. This is determined by its input and parameters, and the MD5SUM of the script for it. Any changes to input and parameters will result in a different ID. This is gauranteed. The key is relevant to buildingmap.mpk
database.'normal:3fce637f'
are parameter values. Bascially'normal:3fce637f'
is the HASH of these values and the MD5SUM of the script for the module (the script is not included here). These values are relevant to buildingdb
database.'__input_output___'
: the first element is the input to this module, the 2nd element is output of this module'__ext__'
: extension of the output file from this module. In current DSC, each module outputs one file. For R, it isrds
, for Python it ispkl
-- these files contain output variables in DSC. For Bash it isyaml
(a meta file).Now look at a more complicated
('abs_err', 1)
. The key('abs_err:0acdbf79:normal:3fce637f:mean:e3f9ad83:normal:3fce637f', 'normal:3fce637f', 'mean:e3f9ad83:normal:3fce637f'):
has two components:abs_err:0acdbf79:normal:3fce637f:mean:e3f9ad83:normal:3fce637f
is the module instance ID. It is made up of itself, and all its dependency modules (its input). Becauseabs_err
takes input fromnormal
andmean
. Andmean
takes input fromnormal
.normal:3fce637f
andmean:e3f9ad83:normal:3fce637f
are dependencies ofabs_err:0acdbf79:normal:3fce637f:mean:e3f9ad83:normal:3fce637f
.dsc_result/dsc_result.map.mpk
This is the database we'll reimplement
The main content is very simple:
'mean:e3f9ad83:t:52a5d4d3': 'mean/t_1_mean_1.rds'
one key corresponding to one unique file name to be saved on disk. However, what's difficult is to efficiently figure out what the file name should be. Take'mean:e3f9ad83:t:52a5d4d3'
for example. It is amean
module taking at
module as input. So it should end up in amean/
folder, witht_??_mean_??
file name indicating that the pipeline so far has executedt
then followed bymean
. But we want to assign unique??
number such that they have a relationship with the module. For example,t:52a5d4d3
will always correspond tot_1
. So the file names are easy to read because when we seet_1
we know they are all from the samet
module.Currently my implementation is very simple: everything is written to a file with contents above. When new information needs to be added to it at a next DSC run:
dsc_result/dsc_result.map.mpk
will be loadeddsc_result.cfg.pkl
, we see which modules are not in this map database already. If they are here, then return the filename value it corresponds to. Otherwise have to figure out a unique file name for it. This is currently achieved by using a__base_ids__
that keeps track of the numbering up to now and add to it for the new input to generate new file names. I'll not explain it in detail because this design is inefficient and we should do a better job with a new implementation.sos/dsc_result.io.pkl
input
), and what are the modules these input are generated from (depends
), and what are the output from this module.Here, in pipeline
2
, thenormal
module is the same as thenormal
module in pipeline'1'
, so instead of writing the same information again, it uses this shortcut to keep the information.Information in
.sos/dsc_result.io.pkl
combines that fromcfg.pkl
which has the pipeline info and dependency info, then uses filenames it found inmap.mpk
to get meaningful filenames.Proposed new implementation for
dsc_result/dsc_result.map.mpk
We should have a database like this:
where
parameters
saves all parameters from thecfg.pkl
file. Thefile
column should have the unique, human-readable filename generated efficiently. This is perhaps the most difficult feature to implement.This database should support efficient row addition and deletions.