gbif / portal-feedback

User feedback for the GBIF API, website and published data. You can ask questions here. 🗨❓
30 stars 16 forks source link

Checksum for occurrence snapshot ZIP downloads #5172

Open thompsonmj opened 9 months ago

thompsonmj commented 9 months ago

It would be helpful for data integrity verification after downloading a dataset from a monthly snapshot to have a checksum provided by the server to compare the downloaded ZIP file to.

We are currently looking at this, for instance: https://doi.org/10.15468/dl.xw682s

Additional context on the downloaded data:

$ 7z t 0003602-240130105604617.zip

7-Zip [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21
p7zip Version 16.02 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,64 bits,40 CPUs Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz (50654),ASM,AES-NI)

Scanning the drive for archives:
1 file, 943196850376 bytes (879 GiB)

Testing archive: 0003602-240130105604617.zip

ERRORS:
Headers Error
Unconfirmed start of archive

WARNINGS:
There are data after the end of archive

--
Path = 0003602-240130105604617.zip
Type = zip
ERRORS:
Headers Error
Unconfirmed start of archive
WARNINGS:
There are data after the end of archive
Physical Size = 565762150683
Tail Size = 377434699693

ERROR: CRC Failed : occurrence.txt

Sub items Errors: 1

Archives with Errors: 1

Warnings: 1

Open Errors: 1

Sub items Errors: 1
thompsonmj commented 8 months ago

Update: I downloaded the same file once more (this time using aria2 rather than wget, which shouldn't make a difference besides speed), and the archive seems to be in good shape.

7-Zip [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21
p7zip Version 16.02 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,64 bits,40 CPUs Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz (50654),ASM,AES-NI)

Scanning the drive for archives:
1 file, 938529016008 bytes (875 GiB)

Testing archive: /fs/scratch/PAS2136/gbif/data//2024-02-01/0003602-240130105604617.zip
--
Path = /fs/scratch/PAS2136/gbif/data//2024-02-01/0003602-240130105604617.zip
Type = zip
Physical Size = 938529016008
64-bit = +

Everything is Ok

Files: 61871
Size:       4846181236102
Compressed: 938529016008
ZIP file integrity test result: 0

FWIW, here is the MD5 for the second download attempt:

$ cat 0003602-240130105604617.zip_checksum.txt
16d5db9526b807050b799917c9336eaf