bede / hostile

Precise host read removal
MIT License
74 stars 5 forks source link

Discard partially downloaded indexes #20

Closed bede closed 8 months ago

bede commented 1 year ago

Currently if a genome/index download is abandoned, Hostile may think it's present and correct leading to errors. Could download to a temp location and move into $XDG_DATA_DIR or download and rename etc

mbhall88 commented 1 year ago

You could possibly do a sha256 check on the existing file? That way you can be certain if it's what you expect or not? I have some code for doing this in tbpore you can use if you want to go that route?

https://github.com/mbhall88/tbpore/blob/1225472de54a2bd6c034b41f4540a1f539473822/tbpore/utils.py#L72-L87

bede commented 1 year ago

Certainly an option – I guess I would either hardcode the checksums or put them into a manifest of some kind to check post-download. Using a newer database with an old version then could cause checksum mismatches, so I'm tempted to just ensure that the download is completed for now.

bede commented 1 year ago

I've mitigated this in 0.1.0 by downloading indexes to a temporary file before moving (minimap2) or extracting (Bowtie2) into the destination XDG data directory. That way if the download is interrupted, the aligner won't try to use a corrupted ref.

If I implement checksum validation, I guess the obvious way avoiding hardcoding would be to put checksum files with the same filename prefix as the ref/index in object storage? Any thoughts @mbhall88?

mbhall88 commented 1 year ago

Yeah I think the "standard" method is putting the checksum in the storage location (e.g.). I think, as you say, this avoids hardcoding the checksums.

bede commented 8 months ago

Thanks @mbhall88, I've added checksum verification in https://github.com/bede/hostile/commit/1e81debf73c4279f9682de08c1edf0791b15d47f for release in v1.