ESPRI-Mod / synda

ESGF Downloader (this is a deprecated repository, the tool has now moved to https://github.com/ESGF/esgf-download)
https://espri-mod.github.io/synda/
21 stars 11 forks source link

Synda reports lots of failed checksums #192

Closed Zeitsperre closed 3 years ago

Zeitsperre commented 3 years ago

I've been trying to download a fairly extensive selection of CORDEX data, and I am noticing that on average more than 90% of downloads are failing due to a mismatched checksum. The data is being downloaded into a ZFS-formatted disk with on-the-fly compression, but I don't imagine that would have an impact on the SHA sums. The internet connection speed is more than adequate.

These numbers seem oddly high, and it has me wondering if there are any options I should consider looking into. I'm not comfortable relaxing the checksum verification as this data is being used in production. Is there an issue with reported file checksums for CORDEX data on ESGF ?

An example output:

2021-10-08 16:40:05,042 INFO SDDMDEFA-102 Transfer failed (sdget_status=0,sdget_error_msg=,error_msg='File corruption detected: local checksum doesn't match remote checksum',file_id=22398,status=error,local_path=/{me}/{folders}/synda/downloads/cordex/output/NAM-22/OURANOS/MPI-M-MPI-ESM-LR/rcp85/r1i1p1/CRCM5/v1/day/tas/v20181107/tas_NAM-22_MPI-M-MPI-ESM-LR_rcp85_r1i1p1_OURANOS-CRCM5_v1_day_20860101-20901231.nc,url=http://cordexesg.dmi.dk/thredds/fileServer/cordex_general/CORDEX/output/NAM-22/OURANOS/MPI-M-MPI-ESM-LR/rcp85/r1i1p1/OURANOS-CRCM5/v1/day/tas/v20181107/tas_NAM-22_MPI-M-MPI-ESM-LR_rcp85_r1i1p1_OURANOS-CRCM5_v1_day_20860101-20901231.nc)

And a queue readout:

status      count  size
done           89  31.6 GB
error        1134  419.8 GB
running         8  2.8 GB
waiting     48433  8.8 TB

Thanks again!

pjournou-ipsl commented 3 years ago

My knowledge about CORDEX data is limited. Consequently, I'm not able to answer to your question yet. I have to investigate. First, I want to reproduce your use case. But, at the moment, I encounter a problem related to a certificate failure... I will give you the result of my analysis as soon as possible. Best regards, Patrice

pjournou-ipsl commented 3 years ago

The test I have just done is the following :

USE CASE with synda version 3.35

synda install http://cordexesg.dmi.dk/thredds/fileServer/cordex_general/CORDEX/output/NAM-22/OURANOS/MPI-M-MPI-ESM-LR/rcp85/r1i1p1/OURANOS-CRCM5/v1/day/tas/v20181107/tas_NAM-22_MPI-M-MPI-ESM-LR_rcp85_r1i1p1_OURANOS-CRCM5_v1_day_20860101-20901231.nc synda daemon start

RESULTS

1 / LOG INFORMATION (transfer.log)

2021-10-26 13:52:41,097 INFO SDDMDEFA-101 Transfer done (file_id=1,status=done,local_path=/synda/data/cordex/output/NAM-22/OURANOS/MPI-M-MPI-ESM-LR/rcp85/r1i1p1/CRCM5/v1/day/tas/v20181107/tas_NAM-22_MPI-M-MPI-ESM-LR_rcp85_r1i1p1_OURANOS-CRCM5_v1_day_20860101 -20901231.nc,url=http://cordexesg.dmi.dk/thredds/fileServer/cordex_general/CORDEX/output/NAM-22/OURANOS/MPI-M-MPI-ESM-LR/rcp85/r1i1p1/OURANOS-CRCM5/v1/day/tas/v20181107/tas_NAM-22_MPI-M-MPI-ESM-LR_rcp85_r1i1p1_OURANOS-CRCM5_v1_day_20860101-20901231.nc)

2 / DB EXTRACTED INFORMATION AFTER DOWNLOAD

{ 'file_id': 1, 'url': 'http://cordexesg.dmi.dk/thredds/fileServer/cordex_general/CORDEX/output/NAM-22/OURANOS/MPI-M-MPI-ESM-LR/rcp85/r1i1p1/OURANOS-CRCM5/v1/day/tas/v20181107/tas_NAM-22_MPI-M-MPI-ESM-LR_rcp85_r1i1p1_OURANOS-CRCM5_v1_day_20860101-20901231.nc', 'file_functional_id': 'cordex.output.NAM-22.OURANOS.MPI-M-MPI-ESM-LR.rcp85.r1i1p1.CRCM5.v1.day.tas.v20181107.tas_NAM-22_MPI-M-MPI-ESM-LR_rcp85_r1i1p1_OURANOS-CRCM5_v1_day_20860101-20901231.nc', 'filename': 'tas_NAM-22_MPI-M-MPI-ESM-LR_rcp85_r1i1p1_OURANOS-CRCM5_v1_day_20860101-20901231.nc', 'local_path': 'cordex/output/NAM-22/OURANOS/MPI-M-MPI-ESM-LR/rcp85/r1i1p1/CRCM5/v1/day/tas/v20181107/tas_NAM-22_MPI-M-MPI-ESM-LR_rcp85_r1i1p1_OURANOS-CRCM5_v1_day_20860101-20901231.nc', 'data_node': 'cordexesg.dmi.dk', 'checksum': '190fe842ff88999c2f0ceeb8fd5ed6d25e74ce5cc7d2c08a71d721252d451beb', 'checksum_type': 'sha256', 'duration': 11.576509, 'size': 358575377, 'rate': 30974396.253654707, 'start_date': '2021-10-26 13:52:26.753712', 'end_date': '2021-10-26 13:52:38.330221', 'crea_date': '2021-10-26 13:52:23.120995', 'status': 'done', 'error_msg': '', 'sdget_status': '0', 'sdget_error_msg': '', 'priority': 1000, 'tracking_id': None, 'model': None, 'project': 'CORDEX', 'variable': 'tas', 'last_access_date': None, 'dataset_id': 1, 'insertion_group_id': 1, 'timestamp': '2018-10-23T19:32:05Z', }

The test result is OK

Can you confirm that your use case gives you the same result today ?

Can you then precise your synda version and the selection file you use ?

Best regards, Patrice.

pjournou-ipsl commented 3 years ago

I've made 11 more tests.

async_http_timeout = 600 seconds (set into the sdt.conf file)

RESULTS

First test (detailed in my previous comment above) start_date = 2021-10-26 13:52:26.753712) duration = 11.576509

Other 11 tests

2021-10-26 15:08:22.471051 <= start_date <= 2021-10-26 15:37:06.183191 duration = 104.191526, 68.204497, 106.795432, 139.868149, 132.941996, 108.818363, 147.907556, 62.711311, 16.955667, 13.103844, 12.051813

ANALYSIS

We can see that the server response is not stable (min = 12.051813 seconds, max = 147.907556 seconds). Waiting time before download starts can be important (we assume that the expected effective download duration is around 11s).

The server behavior may explain your results...

I am going to investigate about the synda error message to see if there is a way to link it more clearly with the problem encountered for the case it was not a problem of checksum.

I hope this analysis may help you.

Patrice

pjournou-ipsl commented 3 years ago

From synda side, the checksum is calculated only when the size of the file is the same as expected.

So, the team would like to add this new item to the analysis, about your sentence : "The data is being downloaded into a ZFS-formatted disk with on-the-fly compression, but I don't imagine that would have an impact on the SHA sums."

The team expects that the ZFS-formatted disk may explain the errors encountered during the checksum control step.