ESPRI-Mod / synda

ESGF Downloader (this is a deprecated repository, the tool has now moved to https://github.com/ESGF/esgf-download)
https://espri-mod.github.io/synda/
21 stars 11 forks source link

'timestamp' not found when installing selection file #152

Closed mathause closed 4 years ago

mathause commented 4 years ago

When trying to install a selection file synda fails with a key error. This started some days ago and does not happen for all queries. One example that fails

synda install -s selection/cmip6_fx.txt

where selection/cmip6_fx.txt file looks as follows:

project="CMIP6"
source_id="all"
experiment_id="historical"
variant_label="all"
variable_id="areacella"
frequency="fx"

This fails with the following traceback:

*** Error occured at 2020-08-07 14:20:00.065239 ***

Traceback (most recent call last):
  File "/lhome/l_mathause/sdt/bin/synda", line 174, in <module>
    status=sdtiaction.actions[args.subcommand](args)
  File "/lhome/l_mathause/sdt/lib/sd/sdtiaction.py", line 352, in install
    (status,newly_installed_files_count)=sdinstall.run(args)
  File "/lhome/l_mathause/sdt/lib/sd/sdinstall.py", line 53, in run
    return _install(metadata,interactive,args.timestamp_right_boundary)
  File "/lhome/l_mathause/sdt/lib/sd/sdinstall.py", line 107, in _install
    sdenqueue.run(metadata,timestamp_right_boundary)
  File "/lhome/l_mathause/sdt/lib/sd/sdenqueue.py", line 75, in run
    metadata=sdpipelineprocessing.run_pipeline(metadata,po)
  File "/lhome/l_mathause/sdt/lib/sd/sdpipelineprocessing.py", line 54, in run_pipeline
    chunk=f(chunk,*args,**kwargs)
  File "/lhome/l_mathause/sdt/lib/sd/sdenqueue.py", line 131, in add_files
    add_file(File(**f))
  File "/lhome/l_mathause/sdt/lib/sd/sdenqueue.py", line 141, in add_file
    sdfiledao.add_file(f,commit=False)
  File "/lhome/l_mathause/sdt/lib/sd/sdfiledao.py", line 29, in add_file
    return sdsqlutils.insert(file,keys_to_insert,commit,conn)
  File "/lhome/l_mathause/sdt/lib/sd/sdsqlutils.py", line 93, in insert
    d=get_dict(instance,columns_subset)
  File "/lhome/l_mathause/sdt/lib/sd/sdsqlutils.py", line 88, in get_dict
    d[k]=instance.__dict__[k]
KeyError: 'timestamp'

The discovery.log does not give any hints for missing timestamps:

2020-08-07 14:19:49,598 INFO SYNDINST-006 Process 'selection/cmip6_fx.txt'
2020-08-07 14:19:49,875 INFO SDSEARCH-580 Retrieve metadata from remote service
2020-08-07 14:19:50,039 INFO SYNDARUN-001 Process query 0
2020-08-07 14:19:51,094 INFO SYDPROXY-100 Search-API call completed (returned-files-count=778,match-count=778,url=http://esgf-data.dkrz.de/esg-search/search?variable_id=areacella&fields=*&project=CMIP6&frequency=fx&experiment_id=historical&distrib=true&type=File&format=application%2Fsolr%2Bjson&limit=9000&offset=0).
2020-08-07 14:19:52,495 INFO SDSEARCH-584 Metadata successfully retrieved (778 files)
2020-08-07 14:19:52,495 INFO SDSEARCH-590 Metadata processing begin
2020-08-07 14:19:52,495 INFO SDPIPELI-004 Start main pipeline
2020-08-07 14:19:53,154 INFO SDPIPELI-006 Main pipeline completed
2020-08-07 14:19:53,154 INFO SDPIPELI-002 Start shrink processing
2020-08-07 14:19:53,618 INFO SDSHRINK-008 Start uniq filter..
2020-08-07 14:19:53,627 INFO SSHRINKU-002 Remove duplicate and replicate..
2020-08-07 14:19:53,627 INFO SYNDRMDR-001 Build 'seen' table..
2020-08-07 14:19:54,078 INFO SYNDRMDR-002 Perform duplicate and replicate suppression..
2020-08-07 14:19:54,517 INFO SDSHRINK-010 uniq filter completed
2020-08-07 14:19:54,519 INFO SDPIPELI-002 Shrink processing completed
2020-08-07 14:19:54,520 INFO SDSEARCH-594 Metadata processing end
2020-08-07 14:19:54,521 INFO SDSEARCH-620 Retrieve timestamps begins
2020-08-07 14:19:54,521 INFO SYNDABTI-301 Submit timestamp queries..
2020-08-07 14:19:54,707 INFO SYNDARUN-001 Process query 0
2020-08-07 14:19:55,305 INFO SYDPROXY-100 Search-API call completed (returned-files-count=780,match-count=780,url=http://esgf-data.dkrz.de/esg-search/search?variable_id=areacella&fields=instance_id,timestamp,_timestamp,type,size&project=CMIP6&frequency=fx&experiment_id=historical&distrib=true&type=Dataset&format=application%2Fsolr%2Bjson&limit=9000&offset=0).
2020-08-07 14:19:56,083 INFO SYNDABTI-304 Transform timestamp data struct..
2020-08-07 14:19:56,097 INFO SYNDABTI-100 325 datasets with timestamp retrieved
2020-08-07 14:19:56,097 INFO SYNDABTI-306 Set missing timestamp..
2020-08-07 14:19:56,538 INFO SDSEARCH-634 Retrieve timestamps ends
2020-08-07 14:19:56,579 INFO SYNDINST-001 'keep new status' process begins
2020-08-07 14:19:57,020 INFO SYNDINST-024 'keep new status' process ends
2020-08-07 14:19:59,058 INFO SYNDINST-002 Store metadata in database..
2020-08-07 14:19:59,446 INFO SDENQUEU-102 Add insertion_group_id..
2020-08-07 14:19:59,864 INFO SDENQUEU-103 Insert files and datasets..
2020-08-07 14:20:00,063 INFO SDENQUEU-003 Create transfer (local_path=/lhome/l_mathause/download/CMIP6/CMIP/AS-RCEC/TaiESM1/historical/r1i1p1f1/fx/areacella/gn/v20200218/areacella_fx_TaiESM1_historical_r1i1p1f1_gn.nc,url=http://esgf-data1.llnl.gov/thredds/fileServer/css03_data/CMIP6/CMIP/AS-RCEC/TaiESM1/historical/r1i1p1f1/fx/areacella/gn/v20200218/areacella_fx_TaiESM1_historical_r1i1p1f1_gn.nc)
2020-08-07 14:20:00,063 INFO SDENQUEU-002 create dataset (dataset_path=CMIP6/CMIP/AS-RCEC/TaiESM1/historical/r1i1p1f1/fx/areacella/gn/v20200218)
2020-08-07 14:20:00,064 INFO SDENQUEU-003 Create transfer (local_path=/lhome/l_mathause/download/CMIP6/CMIP/AS-RCEC/TaiESM1/historical/r1i1p1f1/fx/areacella/gn/v20200624/areacella_fx_TaiESM1_historical_r1i1p1f1_gn.nc,url=http://esgf-data3.diasjp.net/thredds/fileServer/esg_dataroot/CMIP6/CMIP/AS-RCEC/TaiESM1/historical/r1i1p1f1/fx/areacella/gn/v20200624/areacella_fx_TaiESM1_historical_r1i1p1f1_gn.nc)
2020-08-07 14:20:00,064 INFO SDENQUEU-002 create dataset (dataset_path=CMIP6/CMIP/AS-RCEC/TaiESM1/historical/r1i1p1f1/fx/areacella/gn/v20200624)
2020-08-07 14:20:00,064 INFO SDENQUEU-003 Create transfer (local_path=/lhome/l_mathause/download/CMIP6/CMIP/AWI/AWI-ESM-1-1-LR/historical/r1i1p1f1/fx/areacella/gn/v20200212/areacella_fx_AWI-ESM-1-1-LR_historical_r1i1p1f1_gn.nc,url=https://esgf-data1.llnl.gov/thredds/fileServer/css03_data/CMIP6/CMIP/AWI/AWI-ESM-1-1-LR/historical/r1i1p1f1/fx/areacella/gn/v20200212/areacella_fx_AWI-ESM-1-1-LR_historical_r1i1p1f1_gn.nc)
2020-08-07 14:20:00,064 INFO SDENQUEU-002 create dataset (dataset_path=CMIP6/CMIP/AWI/AWI-ESM-1-1-LR/historical/r1i1p1f1/fx/areacella/gn/v20200212)

I don't really have an idea what could cause this? Is it a parallel processing problem?


A potential workaround is to replace

d[k]=instance.__dict__[k]

with

if key == "timestamp":
    d[k] = instance.__dict__.get(k, "1970-01-01T00:00:00.000Z")
else:
    d[k]=instance.__dict__[k]

Is it essential that the timestamp is correct?

@beyerle

painter1 commented 4 years ago

I'm the one who put 'timestamp' in the list of keys to be saved, so I will look into this.

Another solution, also in sdsqlutils.insert.get_dict(), would be something like

  if k in instance.__dict__:
     d[k] = instance.__dict__[k]

But a more fundamental problem is that the file has no timestamp at this point. I will try to reproduce what you saw.

beyerle commented 4 years ago

I see the problem of missing timestamp since July 29. It would be nice, if you can reproduce it. For me it looks like a server-side issue.

painter1 commented 4 years ago

@mathause and @beyerle , what index node do you use?

painter1 commented 4 years ago

There's no need to tell me what index node you use; they are probably all the same. I could reproduce the problem. Synda uses this search url, which is correct: https://esgf-node.llnl.gov/esg-search/search?variable_id=areacella&fields=*&project=CMIP6&frequency=fx&experiment_id=historical&type=File&format=application%2Fsolr%2Bjson If you just visit this url and look hard, you can find the problem - look at the first of the three entries in the JSON list for AWI.

The problem is that the index node returns bad information on some files. It almost surely originates with whoever published it. We could work around a missing time stamp, but that is just a symptom. There is other questionable metadata there. A work-around to the timestamp problem would just postpone the error to when it is harder to catch.

BTW, we don't really need the 'timestamp' attribute. It's almost completely useless because it doesn't have a standard definition. This is the first time that I've seen a good use for it - it helped us catch a publishing error!

Synda does need to be fixed. Errors should not be mysterious, they should be easy to find and understand. The right thing to do is to catch the exception, log it, and move on to the next file. The problem file will never get into the database, let alone be downloaded, until someone asks the publisher to re-publish the file. But if the installation failure is properly logged, it will be possible to do this.

I will comment later once I have a fix ready.

painter1 commented 4 years ago

I just submitted a pull request, "Two bug fixes to prevent crashes", #153. It fixes this issue and another, new problem which can cause a Synda crash. I think that this is more urgent than my earlier pull requests.

zklaus commented 4 years ago

@painter1, we also encountered this problem. One thing that stands out is that all problematic files reside on esgf-data1.llnl.gov as can be verified with a command like the one below. Using a different search node yields the same result. What's more, the same files on other replicas seem to work. So maybe there is a server problem going on after all?

wget -O- 'https://esgf-node.llnl.gov/esg-search/search?variable_id=areacella&fields=*&project=CMIP6&frequency=fx&experiment_id=historical&type=File&format=application%2Fsolr%2Bjson&limit=10000' | jq '.response.docs|map(select(has("timestamp")|not))|map(.data_node)|unique'
AtefBN commented 4 years ago

Hello all, been away for a while I apologize. @painter1 I did not see your PR and I had to deploy a hotfix for this specific issue in v3.11 which is currently live on conda. I think it's in the same spirit of the PR. I'll be marking this issue as closed for now, feel free to reopen if it pops back up.

painter1 commented 4 years ago

@zklaus, I don't see what is wrong. For me that command line just returns

[ "esgf-data1.llnl.gov" ]

and visiting the url in a web browser, I don't see anything unusual. Could you specifically report what your experience is?

painter1 commented 4 years ago

@AtefBN , note that my PR also addresses another "bug" which was newly created by a change in ESGF software. This change is deployed at LLNL and will probably be propagated throughout ESGF, eventually. And the change addressing this issue does not overlap with yours, and is generally useful. So please take a look at this pull request.

zklaus commented 4 years ago

@zklaus, I don't see what is wrong. For me that command line just returns

[ "esgf-data1.llnl.gov" ]

and visiting the url in a web browser, I don't see anything unusual. Could you specifically report what your experience is?

What you say is exactly right and what this tells us is that this problem only occurs with the data node esgf-data1.llnl.gov.

I should have broken the command down, let me do this now. The wget part executes exactly the search that you used before with the addition of limit=10000 so that we get many more results than the default 10.

The resulting json is then filtered by the jq command. The filter expression is a series of sub-expressions that works like this:

.response.docs|  # extract the docs list that is part of the response field of the json
map(select(has("timestamp")|not))|  # select all entries that have no "timestamp" field. remove the "|not" part to invert selection
map(.data_node)|  # print the "data_node" field from those results
unique  # remove duplicates

If we remove |not from this, we get all results that do have a timestamp field. This produces the following list:

[
  "aims3.llnl.gov",
  "crd-esgf-drc.ec.gc.ca",
  "dist.nmlab.snu.ac.kr",
  "esg-dn1.nsc.liu.se",
  "esg1.umr-cnrm.fr",
  "esgf-data.ucar.edu",
  "esgf-data1.llnl.gov",
  "esgf-data2.diasjp.net",
  "esgf-data3.diasjp.net",
  "esgf-node2.cmcc.it",
  "esgf.nci.org.au",
  "esgf.rcec.sinica.edu.tw",
  "esgf3.dkrz.de",
  "noresg.nird.sigma2.no",
  "vesg.ipsl.upmc.fr"
]

So we see that files from all nodes have timestamps except for some files served by LLNL. Note that the data node in question appears in both lists, so the problem might be quite subtle.

If we restrict the search further to the AWI ESM by appending &source_id=AWI-ESM-1-1-LR (convenience link), we get only the three replicas of one file.

The one replica of that file that resides at LLNL is lacking the timestamp field, the other two replicas do have the timestamp field.

Notably, the original upload at DKRZ (identified by replica: false) is functional.

Thus, it seems something has gone wrong in the process of replication to the node esgf-data1.llnl.gov at LLNL.

AtefBN commented 4 years ago

@painter1 Yes I will shortly get to ingest the PRs on this repo, hopefully by start of next week. I will subsequently cut a release with the changes for everyone to use.

painter1 commented 4 years ago

@zklaus , thank you for that detailed information. This AWI file is the problem file from before. I will investigate and should understand the problem next week.

zklaus commented 4 years ago

@painter1, thanks! I did deliberately choose the search and file that you mentioned before as an example. However, in our efforts to maintain our local partial replica of esgf data for analysis, we have encountered this problem with other files as well, always only with the LLNL node. If you need more examples, let me know and I will peruse the logs.

painter1 commented 4 years ago

@zklaus, I just want you to know that your issue with timestamp not being available has not been forgotten. It's a publisher issue though, not a Synda issue. So Sasha Ames intends to look into it.

zklaus commented 4 years ago

@painter1 thanks for keeping me posted! Let me know if you need more input or testing. Should I be tracking this in some other ways?

painter1 commented 4 years ago

I'm not aware of anything formal, e.g. nothing on an issue tracker.