ESGF / esgf-download

ESGF data transfer and replication tool
https://esgf.github.io/esgf-download/
BSD 3-Clause "New" or "Revised" License
15 stars 2 forks source link

Queries with matching files break insertion into database #5

Closed AtefBN closed 1 year ago

AtefBN commented 1 year ago

Steps to replicate: initial download query Second download query with matching files Result:

IntegrityError: (sqlite3.IntegrityError) UNIQUE constraint failed: 
file.file_id
[SQL: INSERT INTO file (file_id, dataset_id, master_id, url, version, 
filename, local_path, data_node, checksum, checksum_type, size, status, sha) 
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)]

(Background on this error at: https://sqlalche.me/e/20/gkpj)
See 
/gpfscmip/gpfsdata/esgf/esgpull2/log/esgpull-update-2023-04-21_09-58-03.log 
for error log.
AtefBN commented 1 year ago

Further details: initial selection :

    distrib:       True                                                     
    experiment_id: piControl                                                
    frequency:     mon                                                      
    mip_era:       CMIP6                                                    
    source_id:     CESM2, CMCC-ESM2, CanESM5-CanOE, GISS-E2-1-G, UKESM1-0-LL
    variable_id:   dfe, mlotstmax, no3, phydiat, psl, si, tos, vo
    variant_label: r1i1p1f1           

Second selection :

    distrib:       True                                                     
    experiment_id: piControl                                                
    frequency:     mon                                                      
    mip_era:       CMIP6                                                    
    source_id:     CESM2, CMCC-ESM2, CanESM5-CanOE, GISS-E2-1-G, UKESM1-0-LL
    variable_id:   dfe, mlotstmax, no3, phydiat, psl, si, tos, vo           
AtefBN commented 1 year ago

When running esgpull update <tag of the second selection> User is prompted with a download confirmation message, which subsequently fails:

846 new files (1.4 TiB).
Send to download queue? [y/n] (y): y
IntegrityError: (sqlite3.IntegrityError) UNIQUE constraint failed: 
file.file_id
[SQL: INSERT INTO file (file_id, dataset_id, master_id, url, version, 
filename, local_path, data_node, checksum, checksum_type, size, status, sha) 
VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)]

(Background on this error at: https://sqlalche.me/e/20/gkpj)
See 
/gpfscmip/gpfsdata/esgf/esgpull2/log/esgpull-update-2023-04-21_10-25-03.log 
for error log.
Aborted!
svenrdz commented 1 year ago

For the record, this issue has been identified as caused by non-matching checksum fields on 2 replicas of the same file, and has nothing to do with multiple queries pointing to the same files. This case of miscomputed checksums is currently unhandled and gives the error provided here.

One solution might be to ignore all replicas of that file, since downloading either will probably raise when the file's checksum is computed.

svenrdz commented 1 year ago

@AtefBN This is fixed with the latest version 0.6.0, live on the conda IPSL channel