Public-Health-Bioinformatics / cpo-pipeline

An analysis pipeline for the purpose of investigating Carbapenemase-Producing Organisms.
MIT License
1 stars 2 forks source link

Verify integrity and identity of databases #2

Open dfornika opened 5 years ago

dfornika commented 5 years ago

The pipeline has several external data dependencies (databases for kraken2, mash sketches, etc). There should be a way to verify if those databases are in an expected state, or if there have been changes to them. For example, two 'standard' kraken2 databases that are built on different dates may have different contents due to the ever-changing contents of RefSeq.

We may be able to use some sort of pre-computed checksum to verify the database integrity. May not want to verify on every pipeline run because calculating the hashes can be slow. Maybe provide a separate 'database verification' script That could be run periodically or run once before a set of pipeline runs are submitted.

ddooley commented 5 years ago

Take a peek at Kive http://cfe-lab.github.io/Kive/ - and ask Don Kirkby about what they did in the hashing department. Kive had great foresight in hashing all inputs and using that to be able to stop/continue jobs and know which parts had to be rerun. Not sure if that included reference databases but I wouldn't be surprised if so. They may have some quick hashing tips.