NASA-PDS / registry-sweepers

Scripts that run regularly on the registry database, to clean and consolidate information
Apache License 2.0
0 stars 1 forks source link

Registry-Sweepers Error: contained no hits when hits were expected #69

Closed sjoshi-jpl closed 9 months ago

sjoshi-jpl commented 10 months ago

💡 Description

The following error has been occurring for ATM and GEO registry-sweeper tasks that has triggered multiple notifications during every Lambda run. Please take a look :

GEO-PROD

Error found in log group '/ecs/pds-geo-prod-registry-sweeper-task':

Timestamp (UTC): 2023-09-03 13:46:48.295000
Log Stream: ecs/pds-geo-prod-registry-sweeper-container/8f0a15e6c18d40b59214433c6abbe05f
Error Message: 2023-09-03 13:46:48,295::pds.registrysweepers.utils.db::ERROR::Response for query 346d70 contained no hits when hits were expected.  Returned data is incomplete.  Response was: {'_scroll_id': 'FGluY2x1ZGVfY29udGV4dF91dWlkDnF1ZXJ5VGhlbkZldGNoAxZBSjMxUGlTSFNaS3c1WUd0b0pvZmtBAAAAAAAAPJgWekZYaWNTdFRRX3k3X3NFYWNSRGZlZxZibE9CX0F6SVRyR1R0RHlYSHBNb2F3AAAAAAAAOVEWdW5nTDhrYTJRSmVkdnQzWnAxaDNWZxZBSjMxUGlTSFNaS3c1WUd0b0pvZmtBAAAAAAAAPJcWekZYaWNTdFRRX3k3X3NFYWNSRGZlZw==', 'took': 2, 'timed_out': False, 'terminated_early': False, '_shards': {'total': 3, 'successful': 3, 'skipped': 0, 'failed': 0}, 'hits': {'total': {'value': 3460390, 'relation': 'eq'}, 'max_score': 1.0, 'hits': []}}

ATM-PROD

Error found in log group '/ecs/pds-atm-prod-registry-sweeper-task':

Timestamp (UTC): 2023-09-03 13:29:26.877000
Log Stream: ecs/pds-atm-prod-registry-sweeper-container/5ab74f8d4b4d45a6b19cda6cfb953956
Error Message: 2023-09-03 13:29:26,877::pds.registrysweepers.utils.db::ERROR::Response for query 945e60 contained no hits when hits were expected.  Returned data is incomplete.  Response was: {'_scroll_id': 'FGluY2x1ZGVfY29udGV4dF91dWlkDnF1ZXJ5VGhlbkZldGNoAxZ3UktlUnNsdVEzU3g2eGx6SVpYaFhRAAAAAAAAACMWdUltaGl1SG9UTU9rSzNWYmo3QTRpdxZ3UktlUnNsdVEzU3g2eGx6SVpYaFhRAAAAAAAAACQWdUltaGl1SG9UTU9rSzNWYmo3QTRpdxZWc180MnowYVFwR2hhb25hd0VncFpRAAAAAAAAAAwWejFZSERmV0xSN2kzclBtNjE1c0c1QQ==', 'took': 1, 'timed_out': False, 'terminated_early': False, '_shards': {'total': 3, 'successful': 2, 'skipped': 0, 'failed': 1, 'failures': [{'shard': -1, 'index': None, 'reason': {'type': 'illegal_state_exception', 'reason': 'node [z1YHDfWLR7i3rPm615sG5A] is not available'}}]}, 'hits': {'total': {'value': 649002, 'relation': 'eq'}, 'max_score': 1.0, 'hits': []}}
alexdunnjpl commented 10 months ago

Self-note - check whether or not scroll_id is being updated on every request, as it should be

alexdunnjpl commented 10 months ago

ATM-PROD appears to be on the OpenSearch side - see failure

}
  'shard': -1,
  'index': None,
  'reason': {
    'type': 'illegal_state_exception',
    'reason': 'node [z1YHDfWLR7i3rPm615sG5A] is not available'
  }
}

@tloubrieu-jpl @sjoshi-jpl I'm not sure where to take this one, as it appears to be a failure of one of the shards associated with that node. Any ideas?

GEO-PROD is a bit weirder - it's not showing any shard failures, nor any errors. I'll introduce some improved logging, hopefully that yields new useful information.

jordanpadams commented 10 months ago

@sjoshi-jpl to look at this closer to see if the shard failure is happening regularly.

alexdunnjpl commented 9 months ago

Logging improvements introduced in #72 , #73 https://github.com/NASA-PDS/registry-sweepers/commit/572ac4169bf473ed816d0a5c988009c3516ccc8d https://github.com/NASA-PDS/registry-sweepers/pull/73/commits/81cb14af7e1882483a980a18086592376a873e9c

Timeout thresholds have also been extended, though it won't be clear for a little while whether or not they're relevant to the timeouts in question.

alexdunnjpl commented 9 months ago

@sjoshi-jpl @jordanpadams I'm fairly certain this is resolved by #77

Closing on that basis. @sjoshi-jpl would you please remove any exclusion rules in the error log escalation lambda? We can re-open if this issue reappears.

sjoshi-jpl commented 9 months ago

@alexdunnjpl done. This issue statement has been removed from the sweepers lambda exceptions list and if they do occur again we should know.