new example: reproducible datascience

bjuergens commented 6 years ago

architecture and requirements

user-stories

reproduce results from a paper

see paper published by sdil's associate
connect to sdil
checkout commit from git (repository&commit-id are linked in paper)
execute
results are exactly the same as the one in the paper

creating reproducible results

validate input data
1. --> make sure the hash is correct
run experiment
update hash on data
check in soft-links to new data

this use-case will be automated in some sort of CI-Script

converting regular project into reproducible project

create hash on source-dir
move all content from source-dir to target-dir//
replace all links in project-dir to source-dir with links to target-dir//

this will be 1 script: create_hashed_data.py (required parameters: source-dir and target-dir and project-dir)

note: the source-dir is a subfolders of project-dir which links to the actual source-dir (softlink or hardlink).

life-cycle

once you convert the project to a "reproducible project" by running the script create_hashed_data.py on the data directory. Then you add a softlink in your code-directory to the hashed data manually and check in the result into git.

when you start you experiment you run validate.py. It will raise an error of the data folder's hash doesn't fit it's content. Also it will raise an error if any file or subfolder in the data dir is writeable.

then you commit everything into git. (this comment will need to be checked out to reproduce the experiment)

the you run create_work_copy.py, which will create a hardlinked copy of the data-dir and update the softlinks in the code-dir. This will allow your experiment to create new files, without changing the hash of existing folders

then you run your experiment.

then you run hash_work_copy.py. It will create a hash of your modified data-dir, create a new hash-folder, hardcopy all files of your working copy into this folder, and will change the softlinks in the code-directory to the new hashed directory.

then you commit the new softlinks into git.

scripts

create_hashed_data.py

validate.py

create_work_copy.py

hash_work_copy.py

code folder structure

the code directory has a sub-dir called data, which is a softlink to a hashed data directory (e.g. /smartdata/proj_iris/v1:sha256:128M:b0efbbc43054beee753cd10fab49ea0fe2fabdba420e72d0ba74fe2a0222dbf9)

data folder structure

variables:

<base>: basefolder for project-data
- e.g. /smartdata/proj_iris/ or /smartdata/ugfam/data/iris
<hash>: hash-value of entire directory
- value depends on content, the filenames and the sub-directories. Does not depend on name of parents-folder
- e.g. v1:sha256:128M:b0efbbc43054beee753cd10fab49ea0fe2fabdba420e72d0ba74fe2a0222dbf9

general folder structure:

<base>/
- <hash>/
  - actual data
- work_<timestamp>
  - working copy, that only exists during exectution

example folder structure:

/smartdata/proj_iris/
- v1:sha256:128M:aa669dcefba57e01bd7ff0526a0001d2118f06adc8106d265b5743b0ee90084f
  - iris.csv
- v1:sha256:128M:280de49cd7cd754b71759bc5da30c31a7be3350bcde2548aebab702272ec1c51
  - iris.csv --> ../v1:sha256:128M:b0efbbc43054beee753cd10fab49ea0fe2fabdba420e72d0ba74fe2a0222dbf9/iris.csv (hardlink)
  - number.csv

bjuergens commented 6 years ago

thanks to https://github.com/SmartDataInnovationLab/dirhash/issues/12

we can now do:

OLD_DATA_DIR="where data currently is"
ARCHIVE="path to hashed repository"
DATA_DIR_IN_PROJECT="Where the softlink to the hashed repository should be created"
spark-submit --master yarn --jars $HOME/dev/dirhash/target/sparkhacks-0.0.1-SNAPSHOT.jar $HOME/dev/dirhash/dirhash.py $OLD_DATA_DIR --add-folder-to-repo $ARCHIVE --softlink $DATA_DIR_IN_PROJECT 2>/dev/null

to convert a project. This takes care of the step "converting regular project into reproducible project"

bjuergens commented 6 years ago

the next step will be "creating reproducible results":

copy (hardlink) data into working-directory
update softlink in repo
run experiment
move modified data to hashed repository
check in soft-links to new data

SmartDataInnovationLab / git_batch