NASA-PDS / registry-sweepers

Scripts that run regularly on the registry database, to clean and consolidate information
Apache License 2.0
0 stars 1 forks source link

Remediate findings from Provenance Script Testing #55

Closed sjoshi-jpl closed 9 months ago

sjoshi-jpl commented 11 months ago

While testing the new registry-sweepers I am noticing that for all the nodes (domains), when the provenance script reaches the point where it's trying to write files to the db, its taking up a significant amount of FreeStorageSpace for that specific node cluster. Ex: When running IMG provenance task (1 vCPU, 8GB RAM - doesn't seem enough as it runs for longer than an hour), it brought down the FreeStorageSpace from 43 GB to 2 GB. The storage space returns to normal once the task completes.

Per discussion with @jordanpadams @tloubrieu-jpl @alexdunnjpl this is expected behavior for heavy-writes. Following are my remediation suggestions.

  1. Increase CloudWatch evaluation period for all alarms to 5 mins instead of 1 min (did this already after noticing alerts today). This should give provenance task some additional time to complete without throwing alerts but it won't help in all cases because some nodes run much longer than others (ex: GEO, IMG).
  2. Increase the volume size of the OpenSearch nodes for which the provenance task is significantly impacting the FreeStorageSpace (EN, GEO, IMG, RMS, SBNPSI).
  3. For nodes that are heavily used, we can increase the provenance task vCPU / memory.
sjoshi-jpl commented 11 months ago

EN - Increase EBS volume to 60GB ATM - No change IMG - Increase EBS volume to 60GB, Increase task size to 2 vCPU, 12 GB RAM RMS - Increase task size to 1 vCPU, 8 GB GEO - Increase task size to 2 vCPU, 12 GB NAIF - No change PPI - No change PSA - No change SBNPSI - No change (for now, although it threw an alert increasing the evaluation period should help here) SBNUMD - No change

sjoshi-jpl commented 11 months ago

Opened DSIO #4280 for increasing EBS volume size (IMG and EN OpenSearch nodes)

sjoshi-jpl commented 10 months ago

@tloubrieu-jpl @jordanpadams after weighing all available options, it looks like our best bet here is to increase the volume size from 100 to 120 per node. Approval received from Jordan, will work with SA team.

tloubrieu-jpl commented 10 months ago

That sounds good, thanks @sjoshi-jpl

sjoshi-jpl commented 10 months ago

DSIO-4306 created with SA team. Once completed, I'll need to revise each task definition for registry-sweeper to write to it's own log group.

sjoshi-jpl commented 10 months ago

All tasks completed. We have individual log groups for each node.