eprintsug / EPrintsArchivematica

Digital Preservation through EPrints-Archivematica Integration - An EPrints export plugin to Archivematica
6 stars 1 forks source link

Use Sha instead of md5 #6

Closed goetzk closed 5 years ago

goetzk commented 5 years ago

Hi, As a related point to #5 ; I would like to propose the spec supports Sha256 instead of md5 as its default checksum.

As a future feature it should probably support md5, sha1 and sha256 (archivematica's three options) but I don't see it being a pressing issue from the start as sha256 and md5 are those supported by eprints OOTB.

From eprints commit be42eb7f4dd9ebc184ff8f7fb0282d2f8e778f21

=item $md5 = $stored->generate_md5

Calculates and returns the MD5 for this file.

=cut

sub generate_md5
{
        my( $self ) = @_;

        my $md5 = Digest::MD5->new;

        $self->get_file(sub {
                $md5->add( $_[0] )
        });

        return $md5->hexdigest;
}

sub update_md5
{
        my( $self ) = @_;

        my $md5 = $self->generate_md5;

        $self->set_value( "hash", $md5 );
        $self->set_value( "hash_type", "MD5" );
}

=item $digest = $file->generate_sha( [ ALGORITHM ] )

Generate a SHA for this file, see L<Digest::SHA> for a list of supported algorithms. Defaults to "256" (SHA-256).

Returns the hex-encoded digest.

=cut

sub generate_sha
{
        my( $self, $alg ) = @_;

        $alg ||= "256";

        my $sha = Digest::SHA->new( $alg );

        $self->get_file(sub {
                $sha->add( $_[0] )
        });

        return $sha->hexdigest;
}

sub update_sha
{
        my( $self, $alg ) = @_;

        $alg ||= "256";

        my $digest = $self->generate_sha( $alg );

        $self->set_value( "hash", $digest );
        $self->set_value( "hash_type", "SHA-$alg" );
}
tw4l commented 5 years ago

I'm definitely okay with preferring sha256, especially if it's already being used by eprints.

Am I understanding correctly that a stock installation of eprints will generate both md5 and sha256 hashes for all uploaded files?

tw4l commented 5 years ago

I would add that we should aim to support multiple hash algorithms from the start. In my experience, this should not greatly complicate development.

@photomedia - thoughts?

tw4l commented 5 years ago

I took a look at a MySQL dump from a development instance of EPrints here at Concordia, and it looks like all uploaded files and EPrints-generated derivatives have an MD5 digest stored in the documents table in the MySQL database. This particular instance of EPrints was not generating any SHA-* hashes, only MD5s.

I tried to look in the EPrints source code where the generate_md5 and/or generate_sha functions are called, but only found the function definitions as pointed out by @goetzk above, I'm not terribly familiar with EPrints and don't know Perl, so it's very possible I'm missing something here. Is it possible to determine which function is being called by default when new files are uploaded?

geo-mac commented 5 years ago

Our EPrints instance generates MD5 only too. This seems to be the default for generating technical metadata about files in EPrints because -- although sha256 is supported by EPrints -- MD5 is specified in a number of configuration files, inc.

perl_lib/EPrints/System.pm perl_lib/EPrints/Apache/Storage.pm perl_lib/EPrints/DataObj/File.pm

There are other .pm and .pl files within EPrints that seem to reference the MD5 digest but I can't interpret why! But certainly the aforementioned would need to be modified and tested. It would be interesting to know (perhaps via the Eprints Tech listserve...?) whether anyone has modified their configuration to support sha256...

photomedia commented 5 years ago

Since EPrints uses MD5 by default, the majority of repository instances will have MD5 hashes in their databases. Therefore, I suggest that we keep our spec to MD5 at this point. Supporting sha256 is a nice feature that would be useful for new repository instances once EPrints starts adding these to the database, but for now, the majority content in repositories likely only has MD5 hashes, so let's support these first.

photomedia commented 5 years ago

@geo-mac I edited your comment to include the clarification from the comment/correction. I agree, let's reach out on the tech list to ask if anyone has modified their config to support sha256. Overall, I agree that it would be nice to include sha256 as an option for the export, but I would place this as a feature to implement later on if it is unlikely to be useful for anyone currently running eprints.

goetzk commented 5 years ago

For anyone watching this discussion in the future: @photomedia contacted the tech list in early January - http://mailman.ecs.soton.ac.uk/pipermail/eprints-tech/2019-January/007648.html .

The only reply on list was mine, observing the same hashes (md5) and behaviour (inconsistency in their usage). http://mailman.ecs.soton.ac.uk/pipermail/eprints-tech/2019-January/007650.html

goetzk commented 5 years ago

Based on the above discussion(s) I think we should remain with md5 as the specified format. It may be appropriate to add a note "Future revisions of this specification will add optional Sha support" or similar.

photomedia commented 5 years ago

OK, so it is decided, we are going with MD5. Thanks @goetzk for the mysql table in your response on the list, it looks like we have inconsistent hashes for files in EPrints, so the issue of missing hashes remains open, https://github.com/eprintsug/EPrintsArchivematica/issues/5 , but I'm closing this one.