Closed sjoshi-jpl closed 9 months ago
Self-note - check whether or not scroll_id is being updated on every request, as it should be
ATM-PROD appears to be on the OpenSearch side - see failure
}
'shard': -1,
'index': None,
'reason': {
'type': 'illegal_state_exception',
'reason': 'node [z1YHDfWLR7i3rPm615sG5A] is not available'
}
}
@tloubrieu-jpl @sjoshi-jpl I'm not sure where to take this one, as it appears to be a failure of one of the shards associated with that node. Any ideas?
GEO-PROD is a bit weirder - it's not showing any shard failures, nor any errors. I'll introduce some improved logging, hopefully that yields new useful information.
@sjoshi-jpl to look at this closer to see if the shard failure is happening regularly.
Logging improvements introduced in #72 , #73 https://github.com/NASA-PDS/registry-sweepers/commit/572ac4169bf473ed816d0a5c988009c3516ccc8d https://github.com/NASA-PDS/registry-sweepers/pull/73/commits/81cb14af7e1882483a980a18086592376a873e9c
Timeout thresholds have also been extended, though it won't be clear for a little while whether or not they're relevant to the timeouts in question.
@sjoshi-jpl @jordanpadams I'm fairly certain this is resolved by #77
Closing on that basis. @sjoshi-jpl would you please remove any exclusion rules in the error log escalation lambda? We can re-open if this issue reappears.
@alexdunnjpl done. This issue statement has been removed from the sweepers lambda exceptions list and if they do occur again we should know.
💡 Description
The following error has been occurring for ATM and GEO registry-sweeper tasks that has triggered multiple notifications during every Lambda run. Please take a look :
GEO-PROD
ATM-PROD