Open bjuergens opened 6 years ago
thanks to https://github.com/SmartDataInnovationLab/dirhash/issues/12
we can now do:
OLD_DATA_DIR="where data currently is"
ARCHIVE="path to hashed repository"
DATA_DIR_IN_PROJECT="Where the softlink to the hashed repository should be created"
spark-submit --master yarn --jars $HOME/dev/dirhash/target/sparkhacks-0.0.1-SNAPSHOT.jar $HOME/dev/dirhash/dirhash.py $OLD_DATA_DIR --add-folder-to-repo $ARCHIVE --softlink $DATA_DIR_IN_PROJECT 2>/dev/null
to convert a project. This takes care of the step "converting regular project into reproducible project"
the next step will be "creating reproducible results":
architecture and requirements
user-stories
reproduce results from a paper
creating reproducible results
this use-case will be automated in some sort of CI-Script
converting regular project into reproducible project
this will be 1 script:
create_hashed_data.py
(required parameters: source-dir and target-dir and project-dir)note: the source-dir is a subfolders of project-dir which links to the actual source-dir (softlink or hardlink).
life-cycle
once you convert the project to a "reproducible project" by running the script
create_hashed_data.py
on the data directory. Then you add a softlink in your code-directory to the hashed data manually and check in the result into git.when you start you experiment you run
validate.py
. It will raise an error of the data folder's hash doesn't fit it's content. Also it will raise an error if any file or subfolder in the data dir is writeable.then you commit everything into git. (this comment will need to be checked out to reproduce the experiment)
the you run
create_work_copy.py
, which will create a hardlinked copy of the data-dir and update the softlinks in the code-dir. This will allow your experiment to create new files, without changing the hash of existing foldersthen you run your experiment.
then you run
hash_work_copy.py
. It will create a hash of your modified data-dir, create a new hash-folder, hardcopy all files of your working copy into this folder, and will change the softlinks in the code-directory to the new hashed directory.then you commit the new softlinks into git.
scripts
create_hashed_data.py
validate.py
create_work_copy.py
hash_work_copy.py
code folder structure
the code directory has a sub-dir called
data
, which is a softlink to a hashed data directory (e.g./smartdata/proj_iris/v1:sha256:128M:b0efbbc43054beee753cd10fab49ea0fe2fabdba420e72d0ba74fe2a0222dbf9
)data folder structure
variables:
<base>
: basefolder for project-data/smartdata/proj_iris/
or/smartdata/ugfam/data/iris
<hash>
: hash-value of entire directoryv1:sha256:128M:b0efbbc43054beee753cd10fab49ea0fe2fabdba420e72d0ba74fe2a0222dbf9
general folder structure:
<base>/
<hash>/
work_<timestamp>
example folder structure:
/smartdata/proj_iris/
v1:sha256:128M:aa669dcefba57e01bd7ff0526a0001d2118f06adc8106d265b5743b0ee90084f
iris.csv
v1:sha256:128M:280de49cd7cd754b71759bc5da30c31a7be3350bcde2548aebab702272ec1c51
iris.csv
-->../v1:sha256:128M:b0efbbc43054beee753cd10fab49ea0fe2fabdba420e72d0ba74fe2a0222dbf9/iris.csv
(hardlink)number.csv