NASA-PDS / harvest

Standalone Harvest client application providing the functionality for capturing and indexing product metadata into the PDS Registry system (https://github.com/nasa-pds/registry).
https://nasa-pds.github.io/registry
Other
4 stars 3 forks source link

Update harvest to support batches with data volumes larger than AOSS allowable limit #207

Open tloubrieu-jpl opened 2 days ago

tloubrieu-jpl commented 2 days ago

💡 Description

See current error that Irma Trejo from ATM node had: 2024-10-29 13:38:48,617 [ERROR] LIDVID = urn:nasa:pds:juno_uvs:data_raw:uvs_eng_660055443_2020336_p30ra4_v01::1.0, Message = [parent] Data too large, data for [indices:data/write/bulk[s]] would be [2052453826/1.9gb], which is larger than the limit of [2040109465/1.8gb], real usage: [2050652808/1.9gb], new bytes reserved: [1801018/1.7mb], usages [request=835584/816kb, fielddata=0/0b, in_flight_requests=27620046/26.3mb]

This ticket would need to be converted as a requirement when we better understand what this enhancement eaxactly mean.

⚔️ Parent Epic / Related Tickets

No response

alexdunnjpl commented 2 days ago

@tloubrieu-jpl @al-niessner Unless I'm wildly underestimating the potential size of a registry document, I'm guessing that harvest is using a static document count (like 50k or something) when forming bulk update requests, and encountering a block of very-large documents (probably ones with per-pixel attributes, as I've seen in GEO before).

If so, the fix is likely to send bulk updates when the estimated request size buffer reaches some threshold, or that document count threshold, whichever occurs first.

This is the implementation used by sweepers to achieve that purpose

al-niessner commented 2 days ago

@alexdunnjpl @tloubrieu-jpl

There are two things that can happen. One, too many items in bulk request. Two, item in bulk request is too big. I am not looking at harvest right now, but my crappy memory is telling me it only counts the number of objects going into the bulk request. It does not try to byte size fit it.

Here is a question, if the byte size limit is x for bulk requests and the object being injected is y, where y > x, then how do we want to resolve it? Does it mean that y is simply too big for AOSS?

alexdunnjpl commented 2 days ago

@al-niessner less of a limit, more of a threshold - when I tested I found that the optimal request size was 20-30MB (in terms of diminishing returns on throughput with larger sizes)

So my recommendation would be to have it add docs to the buffer piecewise until the buffer size exceeds 30MB, then flush and repeat.

So in the case of doc > 30MB, it would be written at whatever size it is, since it would be added before checking the total buffer size against the threshold.

If there is truly a case where a single document is >1.8GB, I'd suggest that trying to stuff it in a catalog database is a wrong task to attempt in the first place.