NCEAS / metacat

Data repository software that helps researchers preserve, share, and discover data
https://knb.ecoinformatics.org/software/metacat
GNU General Public License v2.0
27 stars 13 forks source link

On create() calls, dynamic checksum calculation is incorrect for larger files #1492

Closed csjx closed 3 years ago

csjx commented 3 years ago

Jasmine has been cloning data from the LTER repository to the Arctic Data Center repository, and has run into an issue where only some files in a package are cloned correctly. The error in Metacat is:

metacat 20210312-15:08:29: [ERROR]: D1NodeService.writeStreamToFile - the check sum calculated from the saved local file is DA39A3EE5E6B4B0D3255BFEF95601890AFD80709. But it doesn't match the value from the system metadata efb14793e275d3af56835a3752579913c73a0f8b for the object urn:uuid:da61eb2f-72d1-4205-a7f9-a8e5cc9b0d4d [edu.ucsb.nceas.metacat.dataone.D1NodeService:writeStreamToFile:1809]

In this example, the complete LTER package displays in DataONE Search correctly. However, the same package cloned to the test ADC repository only includes four of the packaged files.

Note that the files that succeeded in the create() call are all in the KB size range, whereas the ones that failed are in the MB range. The example from the error above is the BLE_LTER_circulation_burst_2019_2020.nc NetCDF spatial data file.

When manually downloading the file and and calculating the checksum, it matches with the checksum stated in the system metadata:

$ curl -o "urn:uuid:da61eb2f-72d1-4205-a7f9-a8e5cc9b0d4d.nc" "https://pasta.lternet.edu/package/data/eml/knb-lter-ble/7/3/ade38aa0bf2fef0c3617036dfb9835e7"
$ shasum -a 1 "urn:uuid:da61eb2f-72d1-4205-a7f9-a8e5cc9b0d4d.nc"
efb14793e275d3af56835a3752579913c73a0f8b  urn:uuid:da61eb2f-72d1-4205-a7f9-a8e5cc9b0d4d.nc

So I think the new code to improve performance in release 2.13.0 that dynamically calculates checksums as we write objects to disk is failing on larger files in the MB or above range. We likely need to test this with a unit test with larger content to try to reproduce the issue.

taojing2002 commented 3 years ago

Hi Chris. I used the curl command to successfully upload the 15M data file BLE_LTER_circulation_wave.csv to both my local and test.arcticdata.io Metacat instances.

curl  -H "Authorization: Bearer ${token}" -F "sysmeta=@sysmeta-data.xml" -F "object=@BLE_LTER_circulation_wave.csv"  -F "pid=data.62.1" -X POST "https://test.arcticdata.io/metacat/d1/mn/v2/object" 

So I am not sure if it is a really bug or not. I will try a big file as well.

taojing2002 commented 3 years ago

Now when I tested a file with 884M, I got the issue.

taojing2002 commented 3 years ago

Now I can't reproduce the issue. Jasmine and I tested against test.arcticdata.io and I found the file size is always zero in the metacat temp folder when the objects have wrong checksum.

taojing2002 commented 3 years ago

This could be a R client issue - Jasmin found that one of the functions at R has the limit on the size of file. So it can be the culprit of the issue. She is trying to fix it now.

taojing2002 commented 3 years ago

Now it works for her after she fixed the limit on the file size. So Metacat doesn't have the issue but the R client had this issue.