Open witxka opened 4 years ago
Related code: https://github.com/PanDAWMS/dkb/blob/master/Utils/Dataflow/data4es-nested/095_datasetInfoAMI/amiDatasets.py#L198-L201
This error does not interrupt the whole process, right? And since it's rare, it must be not about something we can fix on our side.
I suggest that we do about the following about this:
Make it look a bit more accurate (not as a traceback):
'error'
response field and output it's value instead of a full traceback.Pause the process and retry a bit (30-60 seconds) later. Seems like in this case the error appeared in a short interval of time (2020-10-07 10:24:41-- 2020-10-07 10:24:52), so I believe it was due to restart of some service at the AMI server side. BTW, if we get the AMI server response, the pyAMI client doesn't try to query another instance, while in this case it might be our resque :(
If the retry failed -- (properly) skip this message processing and go on:
The only problem here is that if the issue wasn't somehow fixed in a couple of minutes, and we're waiting for 30 seconds for every message passing through this stage -- the whole process will take very long time. Is it OK, or do we need some more elaborate scenario, like "if it fails N times in a row, stop retrying; just query AMI once for each new message and skip it if the problem's still here"?
root@aiatlas171:/var/log/dkb/data4es-hourly.log