I think another solution would be better. I will start other approach soon
Problem
So far the data dependencies directory was not tracked by our git repository, it was prevalent of the situation of using outdated data dependencies. This change ensures we can track the data dependency with checksum, and further we can verify if we are using intended version before running our nextflow workflow. This PR should resolve the issue #53.
Changes
Updated the readme to instruct how to use DVC
Checksum validation using DVC (except VEP, and HGMD related)
Allowed relative paths for ref_dir related parameters
Added a --skip_data_checksum option.
Proposed Issue Coverages
[x] Implement checksum verification as a default step in the pipeline
[x] Add a --skip_data_checksum option to bypass the verification
[x] Create a checksum file in the ref_dir if it doesn't exist
[x] Store prepared checksum values for internal and free versions
[x] Verify checksums against the prepared values without recreating them
Notes
VEP could not be covered with DVC for the sake of the huge amount of the volume.
We excluded HGMD related files, so the dvc tracked files are all shared between the internal and the public.
I think another solution would be better. I will start other approach soon