bihealth / cubi-tk

CUBI Tooling for SODAR, VarFish et al.
MIT License
4 stars 6 forks source link

Should `irods check` validate the stored data or against the md5 file #185

Open xiamaz opened 1 year ago

xiamaz commented 1 year ago

Currently all check commands for irods work against the separately stored md5 file. This is similar to what is being done by the sodar server commands. After moving a landing zone, there should be no additional need to manually validate these files.

These commands duplicate logic already contained in irods, as validation of replica checksums against the stored data is already part of irods itself.

Unless there are sodar independent workflows which require manual validation of uploaded md5 files, I would propose replacing the checks with native irods checksum checks in cubi-tk.

This affects irods/check, sea-snap/check_irods and snappy.

xiamaz commented 1 year ago

@ericblanc20 @holtgrewe Input would be much appreciated

ericblanc20 commented 1 year ago

I am not sure I understand what you propose to do. I may be mistaken, but I understand that:

In functional analysis projects, it is often valuable to be able to verify that the local analysis files (on the cluster) are identical to those stored on SODAR, especially when the analysis report had been re-run.

xiamaz commented 1 year ago

Thanks. The issue is that currently the checksum for any individual file is stored in both individual md5 files with the same name and in the irods metadata itself.

Given your use-cases at no point should the md5 file in irods be necessary, as it should always be better to let irods compute and store the checksum for us. E.g. irods check should just perform https://github.com/irods/python-irodsclient#computing-and-retrieving-checksums and pipeline specific checks should compare the checksum obtained from the irods metadata against a locally computed checksum.

sellth commented 1 year ago

This is an interesting point and maybe @mikkonie can chime in on this once he's back from vacation. Why do we actually move the .md5 files into the main iRODS storage? They are only needed for landing zone validation and could be discarded afterwards as the hashsums are also stored in the iRODS metadata.

Edit: I guess there is some use in having them readily available for another check after downloading data from SODAR (especially when not using iRODS tools i.e. Davrods), but this then begs the question why they're not shown in the "List files" web view.