LiuzLab / AI_MARRVEL

AI-MARRVEL (AIM) is an AI system for rare genetic disorder diagnosis
GNU General Public License v3.0
8 stars 5 forks source link

[WIP] Use DVC to check data dependency directory integrity #70

Open jylee-bcm opened 1 month ago

jylee-bcm commented 1 month ago

I think another solution would be better. I will start other approach soon

Problem

So far the data dependencies directory was not tracked by our git repository, it was prevalent of the situation of using outdated data dependencies. This change ensures we can track the data dependency with checksum, and further we can verify if we are using intended version before running our nextflow workflow. This PR should resolve the issue #53.

Changes

  • Updated the readme to instruct how to use DVC
  • Checksum validation using DVC (except VEP, and HGMD related)
  • Allowed relative paths for ref_dir related parameters
  • Added a --skip_data_checksum option.

Proposed Issue Coverages

  • [x] Implement checksum verification as a default step in the pipeline
  • [x] Add a --skip_data_checksum option to bypass the verification
  • [x] Create a checksum file in the ref_dir if it doesn't exist
  • [x] Store prepared checksum values for internal and free versions
  • [x] Verify checksums against the prepared values without recreating them

Notes

  • VEP could not be covered with DVC for the sake of the huge amount of the volume.
  • We excluded HGMD related files, so the dvc tracked files are all shared between the internal and the public.