NASA-PDS / harvest

Standalone Harvest client application providing the functionality for capturing and indexing product metadata into the PDS Registry system (https://github.com/nasa-pds/registry).
https://nasa-pds.github.io/registry
Other
4 stars 3 forks source link

Improve Fault Tolerance of Harvest for Forbidden Access error and Timeout #125

Closed jordanpadams closed 10 months ago

jordanpadams commented 1 year ago

💡 Description

See resolution of #124

Examples of two different timeouts:

  1. [ERROR] 10,000 milliseconds timeout on connection
  2. [ERROR] Read timed out

From GEO:

We recently added a new PDART data bundle to our MEX archives. The bundle has over 1.5 million files. Harvest timed out last night, and I just timed out again in the last couple of minutes. Here is a screen capture: The top part of the screen is the previous attempt overnight. The next is my run attempt from minutes ago. image

From ATM: harvest-3.7.6 with -O option to overwrite previous version. First run:

[ERROR] method [GET], host [https://search-atm-prod-mkvgzojag2ta65bnotqdpopzju.us-west-2.es.amazonaws.com:443], URI [/registry/_search], status line [HTTP/1.1 429 Too Many Requests] 429 Too Many Requests /registry/_search

Second run:


2023-09-06 14:36:12,768 [ERROR] Read timed out
2023-09-06 14:36:51,590 [ERROR] 429 Too Many Requests /registry/_bulk
2023-09-06 14:39:21,175 [ERROR] Read timed out
2023-09-06 14:39:50,633 [ERROR] Read timed out
2023-09-06 14:40:35,683 [ERROR] Read timed out
2023-09-06 14:40:44,339 [ERROR] Read timed out2023-09-06 14:41:21,252 [ERROR] Read timed out
2023-09-06 14:45:33,578 [ERROR] Read timed out
2023-09-06 14:56:04,527 [ERROR] Read timed out
2023-09-06 15:00:18,015 [ERROR] request body is required

[INFO] Wrote 43134 product(s) [SUMMARY] Summary: [SUMMARY] Skipped files: 0 [SUMMARY] Loaded files: 43134 [SUMMARY] Product_Browse: 11514 [SUMMARY] Product_Bundle: 1 [SUMMARY] Product_Collection: 9 [SUMMARY] Product_Document: 8 [SUMMARY] Product_Observational: 31602 [SUMMARY] Failed files: 10 [SUMMARY] Package ID: d32d4b8d-b306-4a03-b2e8-b5b203c7a30e



[harvest.log](https://github.com/NASA-PDS/harvest/files/12551487/harvest.log)
alexdunnjpl commented 1 year ago

Following on from outcome of #124

@jordanpadams was the increase in disk allocation this morning (approximately) the result of auto-tuning, or a manual action?

tloubrieu-jpl commented 1 year ago

@alexdunnjpl investigate sweeper to see if it is causing instability.

@sjoshi-jpl configured slow logs to watch longer requests to OpenSearch.

tloubrieu-jpl commented 1 year ago

The document size on ATM and GEO are bigger than usual, which make the fix size chunks (e.g 10000).

The solution is to reduce the size of the pages for repairkit.

ATM works well with smaller pages.

But there is a remaining issue on GEO maybe related to the disk usage. The issue happens when we write the version of repairkit which ran on the documents. Likely the re-indexation of these

tloubrieu-jpl commented 1 year ago

ATM is now available and stable.

GEO only had one time out.

PSA still has issues.

alexdunnjpl commented 1 year ago

Suggested approach - use harvest unit-tests to mock non-200 responses to test retry policy once implemented

tloubrieu-jpl commented 12 months ago

@alexdunnjpl will implement a retry behavior in harvest to solve this ticket.

tloubrieu-jpl commented 12 months ago

Size of the instances have been upgraded by the SAs. Tests are still needed.

jordanpadams commented 11 months ago

Status: @alexdunnjpl working through investigating these issues

jordanpadams commented 11 months ago

Status: @sjoshi-jpl ticket for SA to increase EBS volumes to be 200GB each for nodes that are having issues.

jordanpadams commented 11 months ago

Status: Implementation ongoing

alexdunnjpl commented 11 months ago

Status: rudimentary fix applied to registry-common, select users have been instructed to retry their previously-failing jobs with the new harvest snapshot, per @jordanpadams

Awaiting feedback before continuing

jordanpadams commented 10 months ago

Resolving per https://github.com/NASA-PDS/registry-common/pull/42 and https://github.com/NASA-PDS/registry-common/pull/43. Will re-open if issue is identified again