DataONEorg / rdataone

R package for reading and writing data at DataONE data repositories
http://doi.org/10.5063/F1M61H5X
36 stars 19 forks source link

Create packages with uniform checksum #261

Closed gothub closed 3 years ago

gothub commented 3 years ago

Quick synopsis: all DataPackage members must have the same checksum in order to facilitate serialization to the BagIt format.

The details: In order to serialize a DataPackage to BagIt format, all package members must have the same checksum. A specific checksum algorithm is not specified, but the payload manifest(s) must include all files in the bag, with their checksum, all using the same algorithm.

In order to facilitate this serialization, a DataPackage must have all package members store their member checksum using the same checksum algorithm.

All workflows for creating a DataPackage must support a consistent checksum:

  1. Create a package from local files
  2. Create a package by downloading from DataONE a. an entire package can be downloaded via getDataPackage() b. individual members can be downloaded via getDataObject() and added to a DataPackage c. either a. or b. can specify lazy-loading, such that the sysmeta is downloaded but not the data bytes
  3. Create a package using a combination of 1. and 2.

One way to support all these workflows is to add a checksumAlgorithm parameter to

with the default value being SHA-256. When DataObjects are newly created, this algorithm will be used. When DataObjects are created from objects downloaded from DataONE, if the sysmeta of the existing object has a different checksum algorithm, then it will be recalculated and stored in the sysmeta of the DataObject. If the DataObject was lazy-loaded, then a request is sent to DataONE to calculate the desired checksum for the pid, and the returned value is stored in that DataObject's sysmeta.

Here are the proposed signatures for the modified methods:

setMethod("getDataObject", "D1Client", function(x, identifier, lazyLoad=FALSE, limit="1MB", quiet=TRUE,
                                                checksumAlgorithm="SHA-256")
setMethod("getDataPackage", "D1Client", function(x, identifier, lazyLoad=FALSE, limit="1MB", quiet=TRUE,
                                                 checksumAlgorithm="SHA-256")
ThomasThelen commented 3 years ago

Edit: Relative to Metacat's implementation I think that creating data packages that are using uniform checksum methods in their system metadata is in general a good idea. I don't think that designing the BagIt stuff to expect/need them to be uniform is a good idea. Even when this change is made, we're still left with n data packages that are using a mix of system metadata (which means we'll have to write code to support them anyways).

What we get from a package with uniform checksumming is the performance boost of not having to checksum the files (if we think that's a good idea).

My plan (discussed later in the V2 call) is to checksum every file as it's leaving Metacat by creating the checksum on the fly (see DigestInputStream). Checksumming the leaving bytes is fairly cheap, keeps the code less complex, gives a 'more accurate' checksum (what if there was bit rot since the time of submission), allows for flexibility in the bag checksum algorithm, and gives us the ability to (if we want/can) perform a final validation of streamed checksums vs what exists in the system metadata.

gothub commented 3 years ago

@ThomasThelen the changes to rdataone mentioned here involve using the DataONE MNRead.get() call.

Aren't the changes that you are making to Metacat related to the MNPackage.getPackage() call? If this is true, then the changes mentioned here aren't affected by your changes.

gothub commented 3 years ago

After testing this a bit more, I realized that the default case should be to NOT recalculate the checksum of each package member as it is downloaded from DataONE. The default case for getDataPackage() was to recalculate checksums as "SHA-256" (the new dataone package default), if they were not already using this checksum algorithm.

The new default is to not re-calculate the checksums when using getDataPackage() or getDataObject(), as this could cause long processing delays for packages with many members. If using one or both of these functions, users will have to specify a checksum algorithm to use if they wish to have the checksum recalculated and stored in the sysmeta of package members.

The use case for this functionality, as described above, is to create a package with all package members using the same checksum algorithm or a different than original algorithm, to allow serialization to BagIt. Note that updates to Bagit serialization will be added in the next dataone release, so this recalculating functionality might not be exercised by users until then (but it's ready now).

Note: I have not found a way using httr to have the checksum calculated as bytes are streamed to the client. If anyone knows of a way to do this, please post here.