ESGF / esgf-download

ESGF data transfer and replication tool
https://esgf.github.io/esgf-download/
BSD 3-Clause "New" or "Revised" License
15 stars 2 forks source link

error on import-synda for existing db #1

Open AtefBN opened 1 year ago

AtefBN commented 1 year ago

`(esgpull) -bash-4.2$ esgpull self import-synda /gpfscmip/gpfsdata/esgf/synda-cmn/db/CMIP5/sdt.db Found 810229 files to import, proceed? [y/n]: y Working... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0% -:--:--AttributeError: 'NoneType' object has no attribute 'upper' See /gpfscmip/gpfsdata/esgf/esgpull1/log/esgpull-import_synda-2023-04-11_08-46-39.log for error log. Aborted! Working... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0% -:--:-- (esgpull) -bash-4.2$ cat /gpfscmip/gpfsdata/esgf/esgpull1/log/esgpull-import_synda-2023-04-11_08-46-39.log [2023-04-11 10:46:46] DEBUG root Locals: { 'self': SyndaFile( file_id=297, url='http://aims3.llnl.gov/thredds/fileServer/cmip5_css01_data/cmip5/output1/LASG-CESS/FGOALS-g2/lgm/fx/atmos/fx/r0i0p0/v20130314/areacella/areacella_fx_FGOALS-g2_lgm_r0i0p0.nc', file_functional_id='cmip5.output1.LASG-CESS.FGOALS-g2.lgm.fx.atmos.fx.r0i0p0.v20130314.areacella_fx_FGOALS-g2_lgm_r0i0p0.nc', filename='areacella_fx_FGOALS-g2_lgm_r0i0p0.nc', local_path='CMIP5/output1/LASG-CESS/FGOALS-g2/lgm/fx/atmos/fx/r0i0p0/v20130314/areacella/areacella_fx_FGOALS-g2_lgm_r0i0p0.nc', data_node='aims3.llnl.gov', checksum=None, checksum_type=None, duration=None, size=42760, rate=None, start_date=None, end_date=None, crea_date='2020-11-03 14:44:18.596992', status='done', error_msg=None, sdget_status=None, sdget_error_msg=None, priority=1000, tracking_id='7181939e-4b39-4eaf-a4be-85eae5b5a9e9', model='FGOALS-g2', project='CMIP5', variable='areacella', last_access_date=None, dataset_id=97, insertion_group_id=1, timestamp='2013-03-12T17:25:11Z' ), 'file_id': 'cmip5.output1.LASG-CESS.FGOALS-g2.lgm.fx.atmos.fx.r0i0p0.v20130314.areacella_fx_FGOALS-g2_lgm_r0i0p0.nc', 'dataset_id': 'cmip5.output1.LASG-CESS.FGOALS-g2.lgm.fx.atmos.fx.r0i0p0.v20130314', 'dataset_master': 'cmip5.output1.LASG-CESS.FGOALS-g2.lgm.fx.atmos.fx.r0i0p0', 'version': 'v20130314', 'master_id': 'cmip5.output1.LASG-CESS.FGOALS-g2.lgm.fx.atmos.fx.r0i0p0.areacella_fx_FGOALS-g2_lgm_r0i0p0.nc', 'url': 'https://aims3.llnl.gov/thredds/fileServer/cmip5_css01_data/cmip5/output1/LASG-CESS/FGOALS-g2/lgm/fx/atmos/fx/r0i0p0/v20130314/areacella/areacella_fx_FGOALS-g2_lgm_r0i0p0.nc', 'local_path': 'CMIP5/output1/LASG-CESS/FGOALS-g2/lgm/fx/atmos/fx/r0i0p0/v20130314/areacella' } Working... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0% -:--:--

[2023-04-11 10:46:46] ERROR root

Traceback (most recent call last): File "/gpfscmip/gpfsdata/esgf/miniconda/envs/esgpull/lib/python3.11/site-packages/esgpull/tui.py", line 154, in logging yield File "/gpfscmip/gpfsdata/esgf/miniconda/envs/esgpull/lib/python3.11/site-packages/esgpull/cli/self.py", line 235, in import_synda nb_imported = esg.import_synda(url=path, track=True, ask=True) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/gpfscmip/gpfsdata/esgf/miniconda/envs/esgpull/lib/python3.11/site-packages/esgpull/esgpull.py", line 227, in import_synda file = synda_file.to_file() ^^^^^^^^^^^^^^^^^^^^ File "/gpfscmip/gpfsdata/esgf/miniconda/envs/esgpull/lib/python3.11/site-packages/esgpull/models/synda_file.py", line 64, in to_file checksum_type=self.checksum_type.upper(), ^^^^^^^^^^^^^^^^^^^^^^^^ AttributeError: 'NoneType' object has no attribute 'upper' `

svenrdz commented 1 year ago

It seems that this file (filename = areacella_fx_FGOALS-g2_lgm_r0i0p0.nc) has no checksum nor checksum_type attributes in the synda CMIP5 database you are importing, and those are currently required by esgpull. The query ran by esgpull on the ESGF search API confirms that the 2 attributes are also missing from the file's metadata: https://esgf-node.ipsl.upmc.fr/esg-search/search?type=File&offset=0&limit=1&format=application%2Fsolr%2Bjson&fields=%2A&query=title%3Aareacella_fx_FGOALS-g2_lgm_r0i0p0.nc&distrib=true&latest=true&retracted=false I am guessing synda used the same query to fill its database at the time this file was added. Now if I increase the limit parameter for this query (numFound tells us 4 replicas exist in this case), checksum and checksum_type do exist in the next 3 replicas' metadata.

Knowing this, 2 things could be done during import to handle missing information:

The first solution might look more complete but it could seriously slow down the import procedure, and does not guarantee missing info will be filled, while the 2nd solution is easy to set up but will definitely introduce divergence between the filesystem and database.

meteorologist15 commented 7 months ago

I also encountered this error. My solution was simply to skip the files that were missing metadata bits, and for me, since there weren't a lot of files missing metadata, this was an acceptable loss. I simply added a try/except block with a little extra information to bypass the error halting the program and add the information to the log. I may submit a pull request soon with my proposed code changes:

In esgpull.py

        nb_imported = 0
        for start in iter_idx_range:
            stop = min(len(synda_ids), start + size)
            ids = synda_ids[start:stop]
            synda_files = synda.scalars(sql.synda_file.with_ids(*ids))
            files: list[File] = []
            for synda_file in synda_files:
                try:
                    file = synda_file.to_file()
                except AttributeError as e:
                    logger.warning(e)
                    warn_msg = f"Skipping {synda_file.filename} due to missing database metadata. Continuing to the next file"
                    print(warn_msg)
                    logger.warning(warn_msg)
                    continue

                if file.sha not in shas:
                    file.queries.append(self.legacy_query)
                    files.append(file)
                    synda_shas.add(file.sha)
            if files:
                nb_imported += len(files)
                self.db.add(*files)