NASA-PDS / registry-sweepers

Scripts that run regularly on the registry database, to clean and consolidate information
Apache License 2.0
0 stars 1 forks source link

Deploy repairkit sweeper to delta and prod #61

Closed jordanpadams closed 9 months ago

jordanpadams commented 10 months ago

💡 Description

Refs https://github.com/NASA-PDS/registry-api/issues/349

sjoshi-jpl commented 10 months ago

@alexdunnjpl are we just deploying a new registry-sweeper image to prod? Or are there other steps that need to be completed for this task?

alexdunnjpl commented 10 months ago

@sjoshi-jpl yeah, just a standard ad-hoc redeployment, then checking to make sure it executes successfully in prod

I'll push the image now

alexdunnjpl commented 10 months ago

Image is pushed, @sjoshi-jpl to confirm that tasks successfully execute.

@sjoshi-jpl do we already have a deployment targeting delta OpenSearch, or just prod?

sjoshi-jpl commented 10 months ago

@alexdunnjpl @tloubrieu-jpl after running tasks multiples times for each domain, here are the findings :

  1. ATM and GEO nodes are timing out with 504 Gateway Error (even after multiple tries).
  2. IMG node took 1 hr 54 mins to complete.
  3. SBNPSI and RMS nodes are running for over 2 hours. They're both not getting past repairkit step.
  4. All other nodes are completing within 1 hour time window without errors.

@alexdunnjpl right now we're not running anything for the delta cluster, we could create a task definition with the newly pushed image to test in delta. Does this answer your question?

tloubrieu-jpl commented 10 months ago

Some of node take too long to be processed (img, psi, rms)

sjoshi-jpl commented 10 months ago

Update -

  1. ATM / GEO still returning 504 errors. ATM has an issue with missing ScrollId.
  2. IMG running for close to 2 hours.
  3. All other tasks completed in under 1 hour.
jordanpadams commented 10 months ago

@nutjob4life can you chat with @sjoshi-jpl and try to help debug the 504 issues he is seeing on those 2 registries?

nutjob4life commented 10 months ago

@jordanpadams will do. @sjoshi-jpl, I'll hit you up on Slack

nutjob4life commented 10 months ago

FYI, met with @sjoshi-jpl to try and debug and brainstorm what's going on here. We decided to use one image previous with ATM and GEO (and although these images were untagged in ECR, thankfully they had unique URIs, and the AWS task def service lets you specify an image by URI) and manually launched the sweepers for these two nodes.

And they worked fine. So the issue seems to be related to the RepairKit additions. I'm going to be reviewing those commits with a closer eye.

alexdunnjpl commented 10 months ago

@nutjob4life I'm like... >80% sure that the issue would be resolved by streaming updates through the bulk write call and letting the write function handle flushing the writes rather than making one bulk write call per doc update.

That, however, requires an update to the interface of that function - it really should take an iterable of update objects/dicts, to allow one to throw a lazy/generator expression at it. Minor changes to the other two sweepers will be necessary to reflect such a change, which is why I didn't just do it as a quick addendum to #54

Happy to take that on if that's easier as I'm waiting on comms for my other high-priority ticket.

sjoshi-jpl commented 10 months ago

@nutjob4life @alexdunnjpl Since last week the PSA node was throwing CPU/Memory alerts and consuming most of the allocated compute for the task. I increased it from 1vCPU / 4GB to 2 vCPU and 16GB but the memory utilization is still over 95%.

tloubrieu-jpl commented 10 months ago

@nutjob4life tried opensearch python bulk api without success.

sjoshi-jpl commented 10 months ago

Per conversation with team yesterday, @alexdunnjpl @nutjob4life will be implementing bulk update changes after which we will need to re-test all nodes to ensure the issues with ATM, GEO and PSA are resolved.

sjoshi-jpl commented 10 months ago

Update:

After testing bulk-update, ATM node is completing successfully.

PSA - still needs 4vCPU and 30GB RAM to complete GEO - running for longer than 3 hours. Had to stagger the task to run every 5 hours to be able to complete.

sjoshi-jpl commented 10 months ago

I've opened DSIO # 4457 to enable slow logs in order to help further troubleshooting.

jordanpadams commented 10 months ago
tloubrieu-jpl commented 9 months ago

70 is going to be the solution for this ticket

tloubrieu-jpl commented 9 months ago

Some error remains, @sjoshi-jpl and @alexdunnjpl will discuss that.

tloubrieu-jpl commented 9 months ago

Remaining errors are due too lack of resources on ECS.

alexdunnjpl commented 9 months ago

Clarification: ATM/GEO errors are suspected to be due to insufficient ECS instance sizing. @sjoshi-jpl has submitted an SA ticket to resize, SAs have actioned, results should be available by COB today.

alexdunnjpl commented 9 months ago

GEO (and probably ATM - need to confirm) errors have been narrowed down to the fact that the documents are huge compared to other nodes - 1000 docs returns ~45MB, so the default page size of 10000 docs causes internal overflows.

[2023-09-19T16:29:46,799][WARN ][r.suppressed             ] [2a6f484c833c0bd8c7f96d4b9c4475f6] path: __PATH__ params: {size=10000, scroll=10m, index=registry, _source_excludes=, _source_includes=}
java.lang.ArithmeticException: integer overflow
    at __PATH__(Math.java:909)
    at org.apache.lucene.util.UnicodeUtil.maxUTF8Length(UnicodeUtil.java:618)
    at org.apache.lucene.util.BytesRef.<init>(BytesRef.java:84)
    at org.opensearch.common.bytes.BytesArray.<init>(BytesArray.java:50)
    at org.opensearch.rest.BytesRestResponse.<init>(BytesRestResponse.java:86)
__AMAZON_INTERNAL__
__AMAZON_INTERNAL__
    at org.opensearch.rest.RestController$ResourceHandlingHttpChannel.sendResponse(RestController.java:518)
    at org.opensearch.rest.action.RestResponseListener.processResponse(RestResponseListener.java:50)
    at org.opensearch.rest.action.RestActionListener.onResponse(RestActionListener.java:60)
    at org.opensearch.rest.action.RestCancellableNodeClient$1.onResponse(RestCancellableNodeClient.java:110)
    at org.opensearch.rest.action.RestCancellableNodeClient$1.onResponse(RestCancellableNodeClient.java:104)
    at org.opensearch.action.support.TransportAction$1.onResponse(TransportAction.java:103)
    at org.opensearch.action.support.TransportAction$1.onResponse(TransportAction.java:97)
    at org.opensearch.performanceanalyzer.action.PerformanceAnalyzerActionListener.onResponse(PerformanceAnalyzerActionListener.java:76)
    at org.opensearch.action.support.TimeoutTaskCancellationUtility$TimeoutRunnableListener.onResponse(TimeoutTaskCancellationUtility.java:106)
    at org.opensearch.action.ActionListener$5.onResponse(ActionListener.java:262)
    at org.opensearch.action.search.AbstractSearchAsyncAction.sendSearchResponse(AbstractSearchAsyncAction.java:574)
    at org.opensearch.action.search.ExpandSearchPhase.run(ExpandSearchPhase.java:132)
    at org.opensearch.action.search.AbstractSearchAsyncAction.executePhase(AbstractSearchAsyncAction.java:377)
    at org.opensearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:371)
    at org.opensearch.action.search.FetchSearchPhase.moveToNextPhase(FetchSearchPhase.java:243)
    at org.opensearch.action.search.FetchSearchPhase.lambda$innerRun$1(FetchSearchPhase.java:125)
    at org.opensearch.action.search.CountedCollector.countDown(CountedCollector.java:64)
    at org.opensearch.action.search.ArraySearchPhaseResults.consumeResult(ArraySearchPhaseResults.java:59)
    at org.opensearch.action.search.CountedCollector.onResult(CountedCollector.java:72)
    at org.opensearch.action.search.FetchSearchPhase$2.innerOnResponse(FetchSearchPhase.java:195)
    at org.opensearch.action.search.FetchSearchPhase$2.innerOnResponse(FetchSearchPhase.java:190)
    at org.opensearch.action.search.SearchActionListener.onResponse(SearchActionListener.java:58)
    at org.opensearch.action.search.SearchActionListener.onResponse(SearchActionListener.java:42)
    at org.opensearch.action.ActionListenerResponseHandler.handleResponse(ActionListenerResponseHandler.java:67)
    at org.opensearch.action.search.SearchTransportService$ConnectionCountingHandler.handleResponse(SearchTransportService.java:413)
    at org.opensearch.transport.TransportService$6.handleResponse(TransportService.java:658)
    at org.opensearch.security.transport.SecurityInterceptor$RestoringTransportResponseHandler.handleResponse(SecurityInterceptor.java:306)
    at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1207)
    at org.opensearch.transport.InboundHandler.doHandleResponse(InboundHandler.java:266)
    at org.opensearch.transport.InboundHandler.handleResponse(InboundHandler.java:258)
    at org.opensearch.transport.InboundHandler.messageReceived(InboundHandler.java:146)
    at org.opensearch.transport.InboundHandler.inboundMessage(InboundHandler.java:102)
    at org.opensearch.transport.TcpTransport.inboundMessage(TcpTransport.java:713)
    at org.opensearch.transport.InboundPipeline.forwardFragments(InboundPipeline.java:155)
    at org.opensearch.transport.InboundPipeline.doHandleBytes(InboundPipeline.java:130)
    at org.opensearch.transport.InboundPipeline.handleBytes(InboundPipeline.java:95)
    at org.opensearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:87)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
    at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:271)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
    at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1533)
    at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1282)
    at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1329)
    at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:508)
    at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:447)
    at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:276)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
    at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
    at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
    at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166)
    at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:719)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:620)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:583)
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493)
    at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
    at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
    at __PATH__(Thread.java:829)

There are a few potential options for resolution:

  1. Size repairkit scroll page size according to the big-document node constraints. This will slow down all sweepers in theory, but it shouldn't be an issue as it only affects the work done on products harvested since the last sweepers run. Near-zero implementation effort/time.

  2. Incorporate dynamic page sizing into the retry backoff. This would improve resilience, but introduces some potential for future confusion when a dev thinks that 10k-size pages are being requested but it's dynamically adjusting that under the hood and chaining them together. This shouldn't be a first-resort imho.

  3. add MAX_FULL_DOC_REQUEST_COUNT or similar as an env var or CLI argument, which would (if present) constrain the page size for relevant sweepers. It would allow for more targeted constraint than the first option, but would add a little complexity and require a little dev effort to do cleanly, which I think might not be justified by the theoretical benefit over the first option.

I'll implement option 1 after testing properly against GEO and ATM, and we can re-visit later if additional flexibility is needed.

alexdunnjpl commented 9 months ago

@sjoshi-jpl initial run of the sweepers against a 2M-product database should be on the order of 4hrs, says my napkin, so expect a period of container execution timeout failures. They should resolve by tomorrow or the next day, though.