Closed goetzk closed 5 years ago
I'm definitely okay with preferring sha256, especially if it's already being used by eprints.
Am I understanding correctly that a stock installation of eprints will generate both md5 and sha256 hashes for all uploaded files?
I would add that we should aim to support multiple hash algorithms from the start. In my experience, this should not greatly complicate development.
@photomedia - thoughts?
I took a look at a MySQL dump from a development instance of EPrints here at Concordia, and it looks like all uploaded files and EPrints-generated derivatives have an MD5 digest stored in the documents table in the MySQL database. This particular instance of EPrints was not generating any SHA-* hashes, only MD5s.
I tried to look in the EPrints source code where the generate_md5 and/or generate_sha functions are called, but only found the function definitions as pointed out by @goetzk above, I'm not terribly familiar with EPrints and don't know Perl, so it's very possible I'm missing something here. Is it possible to determine which function is being called by default when new files are uploaded?
Our EPrints instance generates MD5 only too. This seems to be the default for generating technical metadata about files in EPrints because -- although sha256 is supported by EPrints -- MD5 is specified in a number of configuration files, inc.
perl_lib/EPrints/System.pm perl_lib/EPrints/Apache/Storage.pm perl_lib/EPrints/DataObj/File.pm
There are other .pm and .pl files within EPrints that seem to reference the MD5 digest but I can't interpret why! But certainly the aforementioned would need to be modified and tested. It would be interesting to know (perhaps via the Eprints Tech listserve...?) whether anyone has modified their configuration to support sha256...
Since EPrints uses MD5 by default, the majority of repository instances will have MD5 hashes in their databases. Therefore, I suggest that we keep our spec to MD5 at this point. Supporting sha256 is a nice feature that would be useful for new repository instances once EPrints starts adding these to the database, but for now, the majority content in repositories likely only has MD5 hashes, so let's support these first.
@geo-mac I edited your comment to include the clarification from the comment/correction. I agree, let's reach out on the tech list to ask if anyone has modified their config to support sha256. Overall, I agree that it would be nice to include sha256 as an option for the export, but I would place this as a feature to implement later on if it is unlikely to be useful for anyone currently running eprints.
For anyone watching this discussion in the future: @photomedia contacted the tech list in early January - http://mailman.ecs.soton.ac.uk/pipermail/eprints-tech/2019-January/007648.html .
The only reply on list was mine, observing the same hashes (md5) and behaviour (inconsistency in their usage). http://mailman.ecs.soton.ac.uk/pipermail/eprints-tech/2019-January/007650.html
Based on the above discussion(s) I think we should remain with md5 as the specified format. It may be appropriate to add a note "Future revisions of this specification will add optional Sha support" or similar.
OK, so it is decided, we are going with MD5. Thanks @goetzk for the mysql table in your response on the list, it looks like we have inconsistent hashes for files in EPrints, so the issue of missing hashes remains open, https://github.com/eprintsug/EPrintsArchivematica/issues/5 , but I'm closing this one.
Hi, As a related point to #5 ; I would like to propose the spec supports Sha256 instead of md5 as its default checksum.
As a future feature it should probably support md5, sha1 and sha256 (archivematica's three options) but I don't see it being a pressing issue from the start as sha256 and md5 are those supported by eprints OOTB.
From eprints commit be42eb7f4dd9ebc184ff8f7fb0282d2f8e778f21