PanDAWMS / dkb

Data Knowledge Base for HENP experiments
0 stars 2 forks source link

(ERROR) (ProcessorStage) Failed to process dataset #427

Open witxka opened 4 years ago

witxka commented 4 years ago

root@aiatlas171:/var/log/dkb/data4es-hourly.log

(95) 2020-10-07 10:24:52 (DEBUG) (ProcessorStage) Traceback (most recent call last):
(95) (==)   File "/home/dkb/dkb-test.git/Utils/Dataflow/data4es/run/../095_datasetInfoAMI/amiDatasets.py", line 167, in process
(95) (==)     amiPhysValues(data)
(95) (==)   File "/home/dkb/dkb-test.git/Utils/Dataflow/data4es/run/../095_datasetInfoAMI/amiDatasets.py", line 207, in amiPhysValues
(95) (==)     raise Exception("Unexpected AMI response: %s" % json_str)
(95) (==) Exception: Unexpected AMI response: {u'AMIMessage': [{u'commandArgs': [{u'args': [{u'@argName': u'logicalDatasetName', u'@argValue': u'mc16_13TeV.451853.MadGraphPythia8EvtGen_A14NNPDF23LO_HAHMggfZdZd4l_98_42.simul.HITS.e8247_e5984_s3126'}, {u'@argName': u'scope', u'@argValue': u'mc16_13TeV'}]}], u'executionTime': [{u'$': u'0.018 '}], u'time': [{u'$': u'2020-10-07  at 10:24:52 AM CEST'}], u'connection': [{u'@amiLogin': u'dkb', u'@type': u'router', u'$': u'successful'}], u'command': [{u'$': u'AMIGetPhysicsParamsForDataset'}], u'dumpCommand': [{u'param': [{u'@name': u'command', u'$': u'AMIGetPhysicsParamsForDataset'}, {u'@name': u'logicalDatasetName', u'$': u'mc16_13TeV.451853.MadGraphPythia8EvtGen_A14NNPDF23LO_HAHMggfZdZd4l_98_42.simul.HITS.e8247_e5984_s3126'}, {u'@name': u'scope', u'$': u'mc16_13TeV'}]}], u'error': [{u'$': u"You don't have rights to execute this command."}]}]}
(95) 2020-10-07 10:24:52 (ERROR) (ProcessorStage) Failed to process dataset 'mc16_13TeV.451878.MadGraphPythia8EvtGen_A14NNPDF23LO_HAHMggfZdZd4l_560_230.simul.HITS.e8248_e5984_s3126_tid22774608_00'
mgolosova commented 4 years ago

Related code: https://github.com/PanDAWMS/dkb/blob/master/Utils/Dataflow/data4es-nested/095_datasetInfoAMI/amiDatasets.py#L198-L201

This error does not interrupt the whole process, right? And since it's rare, it must be not about something we can fix on our side.

I suggest that we do about the following about this:

  1. Make it look a bit more accurate (not as a traceback):

    • [ ] in case of "unexpected response" (try to) extract 'error' response field and output it's value instead of a full traceback.
  2. Pause the process and retry a bit (30-60 seconds) later. Seems like in this case the error appeared in a short interval of time (2020-10-07 10:24:41-- 2020-10-07 10:24:52), so I believe it was due to restart of some service at the AMI server side. BTW, if we get the AMI server response, the pyAMI client doesn't try to query another instance, while in this case it might be our resque :(

  3. If the retry failed -- (properly) skip this message processing and go on:

    • [ ] mark the message as 'incomplete'.

The only problem here is that if the issue wasn't somehow fixed in a couple of minutes, and we're waiting for 30 seconds for every message passing through this stage -- the whole process will take very long time. Is it OK, or do we need some more elaborate scenario, like "if it fails N times in a row, stop retrying; just query AMI once for each new message and skip it if the problem's still here"?