SmartDataInnovationLab / git_batch

create a git-remote for running batch-jobs
BSD 3-Clause "New" or "Revised" License
0 stars 0 forks source link

new example: reproducible datascience #9

Open bjuergens opened 6 years ago

bjuergens commented 6 years ago

architecture and requirements

user-stories

reproduce results from a paper

  1. see paper published by sdil's associate
  2. connect to sdil
  3. checkout commit from git (repository&commit-id are linked in paper)
  4. execute
  5. results are exactly the same as the one in the paper

creating reproducible results

  1. validate input data
    1. --> make sure the hash is correct
  2. run experiment
  3. update hash on data
  4. check in soft-links to new data

this use-case will be automated in some sort of CI-Script

converting regular project into reproducible project

  1. create hash on source-dir
  2. move all content from source-dir to target-dir//
  3. replace all links in project-dir to source-dir with links to target-dir//

this will be 1 script: create_hashed_data.py (required parameters: source-dir and target-dir and project-dir)

note: the source-dir is a subfolders of project-dir which links to the actual source-dir (softlink or hardlink).

life-cycle

once you convert the project to a "reproducible project" by running the script create_hashed_data.py on the data directory. Then you add a softlink in your code-directory to the hashed data manually and check in the result into git.

when you start you experiment you run validate.py. It will raise an error of the data folder's hash doesn't fit it's content. Also it will raise an error if any file or subfolder in the data dir is writeable.

then you commit everything into git. (this comment will need to be checked out to reproduce the experiment)

the you run create_work_copy.py, which will create a hardlinked copy of the data-dir and update the softlinks in the code-dir. This will allow your experiment to create new files, without changing the hash of existing folders

then you run your experiment.

then you run hash_work_copy.py. It will create a hash of your modified data-dir, create a new hash-folder, hardcopy all files of your working copy into this folder, and will change the softlinks in the code-directory to the new hashed directory.

then you commit the new softlinks into git.

scripts

create_hashed_data.py

validate.py

create_work_copy.py

hash_work_copy.py

code folder structure

the code directory has a sub-dir called data, which is a softlink to a hashed data directory (e.g. /smartdata/proj_iris/v1:sha256:128M:b0efbbc43054beee753cd10fab49ea0fe2fabdba420e72d0ba74fe2a0222dbf9)

data folder structure

variables:

general folder structure:

example folder structure:

bjuergens commented 6 years ago

thanks to https://github.com/SmartDataInnovationLab/dirhash/issues/12

we can now do:

OLD_DATA_DIR="where data currently is"
ARCHIVE="path to hashed repository"
DATA_DIR_IN_PROJECT="Where the softlink to the hashed repository should be created"
spark-submit --master yarn --jars $HOME/dev/dirhash/target/sparkhacks-0.0.1-SNAPSHOT.jar $HOME/dev/dirhash/dirhash.py $OLD_DATA_DIR --add-folder-to-repo $ARCHIVE --softlink $DATA_DIR_IN_PROJECT 2>/dev/null

to convert a project. This takes care of the step "converting regular project into reproducible project"

bjuergens commented 6 years ago

the next step will be "creating reproducible results":