NOAA-OWP / wres

Code and scripts for the Water Resources Evaluation Service
Other
2 stars 1 forks source link

As a user, I don't expect a job to report success, but the stdout to be incomplete #264

Open epag opened 1 month ago

epag commented 1 month ago

Author Name: Hank (Hank) Original Redmine Issue: 94134, https://vlab.noaa.gov/redmine/issues/94134 Original Date: 2021-07-13


See #93960-90. This affects the jobs from the 3rd iteration of the Test B runs documented in #93852.

If you run the following command on the -ti tasker Docker logs:

docker logs fb7e51f7e5bb | grep "handleException" | grep "StacklessClosedChannelException"

You will see many instances such as the following (showing only the last few instances):

...
2021-07-12T03:37:04.317+0000 [qtp1680365850-77] WARN org.eclipse.jetty.server.HttpChannel - handleException /job/9185292187278467050/status io.netty.channel.StacklessClosedChannelException
2021-07-12T03:37:09.299+0000 [qtp1680365850-78] WARN org.eclipse.jetty.server.HttpChannel - handleException /job/9130188505940267849/status io.netty.channel.StacklessClosedChannelException
2021-07-12T03:37:14.317+0000 [qtp1680365850-74] WARN org.eclipse.jetty.server.HttpChannel - handleException /job/5056259717101377567/status io.netty.channel.StacklessClosedChannelException
2021-07-12T03:37:19.301+0000 [qtp1680365850-78] WARN org.eclipse.jetty.server.HttpChannel - handleException /job/9185292187278467050/status io.netty.channel.StacklessClosedChannelException
2021-07-12T03:37:24.301+0000 [qtp1680365850-78] WARN org.eclipse.jetty.server.HttpChannel - handleException /job/9130188505940267849/status io.netty.channel.StacklessClosedChannelException
2021-07-12T03:37:29.299+0000 [qtp1680365850-74] WARN org.eclipse.jetty.server.HttpChannel - handleException /job/5056259717101377567/status io.netty.channel.StacklessClosedChannelException
2021-07-12T03:37:34.299+0000 [qtp1680365850-77] WARN org.eclipse.jetty.server.HttpChannel - handleException /job/9185292187278467050/status io.netty.channel.StacklessClosedChannelException
2021-07-12T03:37:39.309+0000 [qtp1680365850-78] WARN org.eclipse.jetty.server.HttpChannel - handleException /job/9130188505940267849/status io.netty.channel.StacklessClosedChannelException
2021-07-12T03:37:44.312+0000 [qtp1680365850-78] WARN org.eclipse.jetty.server.HttpChannel - handleException /job/5056259717101377567/status io.netty.channel.StacklessClosedChannelException

For each job mentioned above, the status appears to be success, but the stdout logging is incomplete.

Its unclear to me if this is related to what was reported in #93678 and resolved in #93685. If so, then this may already be resolved. However, I note that, in #83678, the job was stuck "IN_PROGRESS". In this case, all of the jobs above report a status of "COMPLETED_REPORTED_SUCCESS", but the stdout is partial. That difference is why I created this ticket because I'm not sure its important.

I'll include some more complete messages from the Docker logs in the first comment. Thanks,

Hank


Related issue(s): #148 Redmine related issue(s): 122858


epag commented 1 month ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2021-07-13T14:05:44Z


Example complete exception message from the docker logs for the tasker:

2021-07-10T07:26:47.298+0000 [qtp1680365850-71] WARN org.eclipse.jetty.server.HttpChannelState - unhandled due to prior sendError
jakarta.servlet.ServletException: org.redisson.client.WriteRedisConnectionException: Unable to write command into connection! Node source: NodeSource [slot=0, addr=null, redisClient=null, redirect=null, entry=null], connection: RedisConnection@54915189 [
redisClient=[addr=redis://persister:6379], channel=[id: 0x28ee40c2, L:/172.19.254.199:51096 ! R:persister/172.19.254.194:6379], currentCommand=null], command: (EVAL), params: [local value = redis.call('hget', KEYS[1], ARGV[2]); if value == false then ret
urn nil; end; local t,..., 5, jobMetadataById, redisson__timeout__set:{jobMetadataById}, redisson__idle__set:{jobMetadataById}, redisson__map_cache__last_access__set:{jobMetadataById}, {jobMetadataById}:redisson_options, 1625902002503, PooledUnsafeDirect
ByteBuf(ridx: 0, widx: 22, cap: 256)] after 3 retry attempts
        at org.glassfish.jersey.servlet.WebComponent.serviceImpl(WebComponent.java:410)
        at org.glassfish.jersey.servlet.WebComponent.service(WebComponent.java:346)
        at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:366)
        at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:319)
        at org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:205)
        at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:764)
        at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:508)
        at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:221)
        at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1370)
        at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:176)
        at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:463)
        at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:174)
        at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1292)
        at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:129)
        at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:122)
        at org.eclipse.jetty.server.handler.ResourceHandler.handle(ResourceHandler.java:243)
        at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:122)
        at org.eclipse.jetty.server.Server.handle(Server.java:562)
        at org.eclipse.jetty.server.HttpChannel.lambda$handle$0(HttpChannel.java:399)
        at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:656)
        at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:391)
        at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:282)
        at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:319)
        at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:100)
        at org.eclipse.jetty.io.ssl.SslConnection$DecryptedEndPoint.onFillable(SslConnection.java:538)
        at org.eclipse.jetty.io.ssl.SslConnection.onFillable(SslConnection.java:387)
        at org.eclipse.jetty.io.ssl.SslConnection$2.succeeded(SslConnection.java:161)
        at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:100)
        at org.eclipse.jetty.io.SocketChannelEndPoint$1.run(SocketChannelEndPoint.java:101)
        at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:333)
        at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:310)
        at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:168)
        at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:126)
        at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:378)
        at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:894)
        at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1038)
        at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: org.redisson.client.WriteRedisConnectionException: Unable to write command into connection! Node source: NodeSource [slot=0, addr=null, redisClient=null, redirect=null, entry=null], connection: RedisConnection@54915189 [redisClient=[addr=redis
://persister:6379], channel=[id: 0x28ee40c2, L:/172.19.254.199:51096 ! R:persister/172.19.254.194:6379], currentCommand=null], command: (EVAL), params: [local value = redis.call('hget', KEYS[1], ARGV[2]); if value == false then return nil; end; local t,.
.., 5, jobMetadataById, redisson__timeout__set:{jobMetadataById}, redisson__idle__set:{jobMetadataById}, redisson__map_cache__last_access__set:{jobMetadataById}, {jobMetadataById}:redisson_options, 1625902002503, PooledUnsafeDirectByteBuf(ridx: 0, widx: 
22, cap: 256)] after 3 retry attempts
        at org.redisson.command.RedisExecutor.checkWriteFuture(RedisExecutor.java:278)
        at org.redisson.command.RedisExecutor.access$100(RedisExecutor.java:60)
        at org.redisson.command.RedisExecutor$1.operationComplete(RedisExecutor.java:161)
        at org.redisson.command.RedisExecutor$1.operationComplete(RedisExecutor.java:158)
        at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:578)
        at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:552)
        at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:491)
        at io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:616)
        at io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:609)
        at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117)
        at io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetFailure(AbstractChannel.java:1021)
        at io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:882)
        at io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1367)
        at io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:717)
        at io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:764)
        at io.netty.channel.AbstractChannelHandlerContext$WriteTask.run(AbstractChannelHandlerContext.java:1071)
        at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
        at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500)
        at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
        at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        ... 1 common frames omitted
Caused by: io.netty.channel.StacklessClosedChannelException: null
        at io.netty.channel.AbstractChannel$AbstractUnsafe.write(Object, ChannelPromise)(Unknown Source)
2021-07-10T07:26:52.207+0000 [qtp1680365850-75] WARN org.eclipse.jetty.server.HttpChannel - handleException /job/5651580292139938264/status io.netty.channel.StacklessClosedChannelException

Hank

epag commented 1 month ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2021-07-13T14:08:50Z


To be clear, in this case, the job is reporting success, but the stdout is incomplete. For #93678, the job was reporting IN_PROGRESS. That is one difference, which is why I created this ticket. I've updated the Description to clarify,

Hank

epag commented 1 month ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2021-07-14T12:30:07Z


See #93960-103. Again, I noted an evaluation that completed successfully, with a status showing success, but a partial stdout. I believe its related to the same problem I noted for HEFS evaluations in the Description of this ticket. At least, it has the same base symptom.

Looking more closely at the tasker and redis logs...

For the run in #9360-103, I see the following exceptions in the tasker logs:

2021-07-13T21:31:09.096+0000 [redisson-timer-4-1] ERROR org.redisson.client.handler.PingConnectionHandler - Unable to send PING command over channel: [id: 0x3a471bcc, L:/172.19.254.199:35210 - R:persister/172.19.254.194:6379]
org.redisson.client.RedisException: MISCONF Redis is configured to save RDB snapshots, but it is currently not able to persist on disk. Commands that may modify the data set are disabled, because this instance is configured to report errors during writes if RDB snapshotting fails (stop-writes-on-bgsave-error option). Please check the Redis logs for details about the RDB error.. channel: [id: 0x3a471bcc, L:/172.19.254.199:35210 - R:persister/172.19.254.194:6379] command: (PING), params: []
        at org.redisson.client.handler.CommandDecoder.decode(CommandDecoder.java:345)
        at org.redisson.client.handler.CommandDecoder.decodeCommand(CommandDecoder.java:177)
        at org.redisson.client.handler.CommandDecoder.decode(CommandDecoder.java:116)
        at org.redisson.client.handler.CommandDecoder.decode(CommandDecoder.java:101)
        at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:507)
        at io.netty.handler.codec.ReplayingDecoder.callDecode(ReplayingDecoder.java:366)
        at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:276)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
        at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
        at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166)
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:719)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:655)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:581)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493)
        at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
        at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.base/java.lang.Thread.run(Thread.java:829)

That happens many times in a short period of time. +Before+ that happens, I see a steady flow of messages referring to STDOUT, output, and STDERR. For example:

2021-07-13T21:29:43.019+0000 [pool-2-thread-184] INFO wres.tasker.JobMessageHelper - Still waiting for job 1432704281539932446 STDERR
2021-07-13T21:29:45.527+0000 [pool-2-thread-179] INFO wres.tasker.JobMessageHelper - Still waiting for job 1432704281539932446 STDOUT
2021-07-13T21:29:45.730+0000 [pool-2-thread-178] INFO wres.tasker.JobMessageHelper - Still waiting for job 1432704281539932446 output
2021-07-13T21:29:58.019+0000 [pool-2-thread-184] INFO wres.tasker.JobMessageHelper - Still waiting for job 1432704281539932446 STDERR
2021-07-13T21:30:00.530+0000 [pool-2-thread-179] INFO wres.tasker.JobMessageHelper - Still waiting for job 1432704281539932446 STDOUT
2021-07-13T21:30:00.731+0000 [pool-2-thread-178] INFO wres.tasker.JobMessageHelper - Still waiting for job 1432704281539932446 output
2021-07-13T21:30:13.021+0000 [pool-2-thread-184] INFO wres.tasker.JobMessageHelper - Still waiting for job 1432704281539932446 STDERR
2021-07-13T21:30:28.031+0000 [pool-2-thread-184] INFO wres.tasker.JobMessageHelper - Still waiting for job 1432704281539932446 STDERR
2021-07-13T21:30:43.032+0000 [pool-2-thread-184] INFO wres.tasker.JobMessageHelper - Still waiting for job 1432704281539932446 STDERR

+After+ the exception messages, I only see STDERR messages until eventually it reports that its finished waiting for the job:

2021-07-13T21:31:13.036+0000 [pool-2-thread-184] INFO wres.tasker.JobMessageHelper - Still waiting for job 1432704281539932446 STDERR
2021-07-13T21:31:28.036+0000 [pool-2-thread-184] INFO wres.tasker.JobMessageHelper - Still waiting for job 1432704281539932446 STDERR
2021-07-13T21:31:43.037+0000 [pool-2-thread-184] INFO wres.tasker.JobMessageHelper - Still waiting for job 1432704281539932446 STDERR
2021-07-13T21:31:58.038+0000 [pool-2-thread-184] INFO wres.tasker.JobMessageHelper - Still waiting for job 1432704281539932446 STDERR
2021-07-13T21:32:13.039+0000 [pool-2-thread-184] INFO wres.tasker.JobMessageHelper - Still waiting for job 1432704281539932446 STDERR
2021-07-13T21:32:28.039+0000 [pool-2-thread-184] INFO wres.tasker.JobMessageHelper - Still waiting for job 1432704281539932446 STDERR
2021-07-13T21:32:43.040+0000 [pool-2-thread-184] INFO wres.tasker.JobMessageHelper - Still waiting for job 1432704281539932446 STDERR
2021-07-13T21:32:58.041+0000 [pool-2-thread-184] INFO wres.tasker.JobMessageHelper - Still waiting for job 1432704281539932446 STDERR
... SNIP ...
2021-07-14T00:02:58.848+0000 [pool-2-thread-184] INFO wres.tasker.JobMessageHelper - Still waiting for job 1432704281539932446 STDERR
2021-07-14T00:03:13.849+0000 [pool-2-thread-184] INFO wres.tasker.JobMessageHelper - Still waiting for job 1432704281539932446 STDERR
2021-07-14T00:03:28.850+0000 [pool-2-thread-184] INFO wres.tasker.JobMessageHelper - Still waiting for job 1432704281539932446 STDERR
2021-07-14T00:03:31.959+0000 [pool-1-thread-10] INFO wres.tasker.JobResultConsumer - Heard a message, consumerTag: amq.ctag-FBJv__FucjhPrneVUQqk1g, envelope: Envelope(deliveryTag=1, redeliver=false, exchange=wres.job.status, routingKey=job.1432704281539932446.exitCode), properties: #contentHeader<basic>(content-type=null, content-encoding=null, headers=null, delivery-mode=null, priority=null, correlation-id=1432704281539932446, reply-to=null, expiration=null, message-id=null, timestamp=null, type=null, user-id=null, app-id=null, cluster-id=null), message: []
2021-07-14T00:03:32.663+0000 [pool-2-thread-181] INFO wres.tasker.JobStatusSharer - Ignoring report DEAD because other mechanisms should handle it.
2021-07-14T00:03:43.853+0000 [pool-2-thread-184] INFO wres.tasker.JobMessageHelper - Finished waiting for job 1432704281539932446 STDERR
2021-07-14T00:03:47.665+0000 [pool-2-thread-181] INFO wres.tasker.JobMessageHelper - Finished waiting for job 1432704281539932446 status

Throughout those messages, I see additional "Unable to send PING" exceptions, including at the very end of the log.

Scanning higher up in the tasker log, I see that same exception interspersed among other problematic evaluations. Apparently, I overlooked it before.

Since apparently I'm not very good at examining these logs, I have attached both the tasker log (gzipped) and redis log for y'all to look over.

Thanks,

Hank

epag commented 1 month ago

Original Redmine Comment Author Name: Hank (Hank) Original Date: 2021-07-14T13:45:09Z


The problem occurred for the middle of 3 jobs for MARFC, as noted in #94005-15. In that case, there were no reported exceptions in the tasker log. I've attached the last 1000 lines of the tasker log which includes all 3 MARFC jobs. Be sure to scroll down to the messages starting at around 12:30Z; here are the first few lines related to the MARFC evaluations:

2021-07-14T12:31:04.215+0000 [qtp1680365850-1300] INFO wres.tasker.WresJob - Sent a message to queue 'wres.job' with properties '#contentHeader<basic>(content-type=null, content-encoding=null, headers=null, delivery-mode=null, priority=null, correlation-id
=3137993846854910477, reply-to=wres.job.status, expiration=null, message-id=null, timestamp=null, type=null, user-id=null, app-id=null, cluster-id=null)'
2021-07-14T12:31:04.229+0000 [pool-2-thread-196] INFO wres.tasker.JobResults - Looking for exit code on topic job.3137993846854910477.exitCode
2021-07-14T12:31:19.227+0000 [pool-2-thread-197] INFO wres.tasker.JobMessageHelper - Still waiting for job 3137993846854910477 STDERR
2021-07-14T12:31:19.235+0000 [pool-2-thread-198] INFO wres.tasker.JobMessageHelper - Still waiting for job 3137993846854910477 output
2021-07-14T12:31:25.108+0000 [pool-2-thread-199] INFO wres.tasker.JobMessageHelper - Still waiting for job 3137993846854910477 STDOUT

Thanks,

Hank

epag commented 1 month ago

Original Redmine Comment Author Name: Jesse (Jesse) Original Date: 2021-07-19T17:13:17Z


If redis could not persist to disk, that raises the question of which disk it could not write to, which is @/mnt/wres_share/job_data@, which is an NFS mount. I wonder if this also coincided with @/home@ mount issues. They don't appear to be the same IP address, so they might or might not be the same appliance.