On disk transaction logs recovery failing since 23.0.0.9

tschurins commented 10 months ago

Our JEE application is running with OpenLiberty, and is deployed in docker and Kubernetes. We have XA transactions and have configured the transaction logs to be saved on a shared disk. When stopping pods and starting new ones, we see a lot of errors popping up in the console. Note that everything was working correctly in version 23.0.0.8. Note also that when we start new pods, they have a new identity.

Here is the console log. The successfull recovery of the existing logs:

[12/20/23 12:25:49:554 GMT] 00000063 id=         com.ibm.ws.recoverylog.spi.RecoveryDirectorImpl              I CWRLS0011I: Performing recovery processing for a peer WebSphere server (1ea2d7a4be0e).
[12/20/23 12:25:49:554 GMT] 00000063 id=         com.ibm.tx.jta.impl.TxRecoveryAgentImpl                      A WTRN0108I: Recovery initiated for server 1ea2d7a4be0e
[12/20/23 12:25:49:569 GMT
[ffdc.zip](https://github.com/OpenLiberty/open-liberty/files/13727365/ffdc.zip)
] 0000007f id=         com.ibm.tx.jta.impl.RecoveryManager                          I WTRN0135I: Transaction service recovering no transactions.
[12/20/23 12:25:49:588 GMT] 00000063 id=         com.ibm.ws.recoverylog.spi.RecoveryDirectorImpl              I CWRLS0013I: All persistent services have been directed to perform recovery processing for a peer WebSphere server (1ea2d7a4be0e).

Followed by these errors:

[12/20/23 12:26:49:583 GMT] 0000007e id=         com.ibm.ws.recoverylog.spi.RecoveryDirectorImpl              I CWRLS0011I: Performing recovery processing for a peer WebSphere server (1ea2d7a4be0e).
[12/20/23 12:26:49:584 GMT] 0000007e id=         com.ibm.tx.jta.impl.TxRecoveryAgentImpl                      A WTRN0108I: Recovery initiated for server 1ea2d7a4be0e
[12/20/23 12:26:49:632 GMT] 00000082 id=         com.ibm.ws.logging.internal.impl.IncidentImpl                I FFDC1015I: An FFDC Incident has been created: "com.ibm.ws.recoverylog.spi.InternalLogException: Log was already closed: /opt/ol/wlp/usr/shared/tranlog/tl-1ea2d7a4be0e/tranlog com.ibm.ws.recoverylog.spi.MultiScopeRecoveryLog 1593" at ffdc_23.12.20_12.26.49.0.log
[12/20/23 12:26:49:652 GMT] 00000082 id=         com.ibm.ws.logging.internal.impl.IncidentImpl                I FFDC1015I: An FFDC Incident has been created: "com.ibm.ws.recoverylog.spi.InternalLogException: Log was already closed: /opt/ol/wlp/usr/shared/tranlog/tl-1ea2d7a4be0e/tranlog com.ibm.ws.recoverylog.spi.MultiScopeRecoveryLog.keypoint 1373" at ffdc_23.12.20_12.26.49.1.log
[12/20/23 12:26:49:653 GMT] 00000082 id=         com.ibm.ws.recoverylog.spi.MultiScopeRecoveryLog             A CWRLS0008E: Recovery log is being marked as failed. [ 1 transaction ]
[12/20/23 12:26:49:653 GMT] 00000082 id=         com.ibm.ws.recoverylog.spi.MultiScopeRecoveryLog             I CWRLS0009E: Details of recovery log failure: com.ibm.ws.recoverylog.spi.InternalLogException: Log was already closed: /opt/ol/wlp/usr/shared/tranlog/tl-1ea2d7a4be0e/tranlog
    at com.ibm.ws.recoverylog.spi.LogHandle.keypointStarting(LogHandle.java:689)
    at com.ibm.ws.recoverylog.spi.MultiScopeRecoveryLog.keypoint(MultiScopeRecoveryLog.java:1592)
    at com.ibm.ws.recoverylog.spi.MultiScopeRecoveryLog.recoveryComplete(MultiScopeRecoveryLog.java:795)
    at com.ibm.ws.recoverylog.spi.RecoveryLogImpl.recoveryComplete(RecoveryLogImpl.java:90)
    at com.ibm.tx.jta.impl.RecoveryManager.run(RecoveryManager.java:2164)
    at java.lang.Thread.run(Thread.java:826)

[12/20/23 12:26:49:751 GMT] 00000082 id=         com.ibm.ws.logging.internal.impl.IncidentImpl                I FFDC1015I: An FFDC Incident has been created: "java.lang.Exception com.ibm.ws.recoverylog.spi.MultiScopeRecoveryLog.provideServiceability 2469" at ffdc_23.12.20_12.26.49.2.log
[12/20/23 12:26:49:766 GMT] 00000082 id=         com.ibm.ws.logging.internal.impl.IncidentImpl                I FFDC1015I: An FFDC Incident has been created: "java.lang.Exception com.ibm.ws.recoverylog.spi.MultiScopeRecoveryLog.provideServiceability 2471" at ffdc_23.12.20_12.26.49.3.log
[12/20/23 12:26:49:785 GMT] 00000082 id=         com.ibm.ws.logging.internal.impl.IncidentImpl                I FFDC1015I: An FFDC Incident has been created: "com.ibm.ws.recoverylog.spi.InternalLogException: Log was already closed: /opt/ol/wlp/usr/shared/tranlog/tl-1ea2d7a4be0e/tranlog com.ibm.ws.recoverylog.spi.MultiScopeRecoveryLog 805" at ffdc_23.12.20_12.26.49.4.log
[12/20/23 12:26:49:799 GMT] 00000082 id=         com.ibm.ws.logging.internal.impl.IncidentImpl                I FFDC1015I: An FFDC Incident has been created: "com.ibm.ws.recoverylog.spi.InternalLogException: Log was already closed: /opt/ol/wlp/usr/shared/tranlog/tl-1ea2d7a4be0e/tranlog com.ibm.ws.recoverylog.spi.MultiScopeRecoveryLog.recoveryComplete 692" at ffdc_23.12.20_12.26.49.5.log
[12/20/23 12:26:49:821 GMT] 00000082 id=         com.ibm.ws.logging.internal.impl.IncidentImpl                I FFDC1015I: An FFDC Incident has been created: "com.ibm.ws.recoverylog.spi.InternalLogException: Log was already closed: /opt/ol/wlp/usr/shared/tranlog/tl-1ea2d7a4be0e/tranlog com.ibm.tx.jta.impl.RecoveryManager.run 1866" at ffdc_23.12.20_12.26.49.6.log
[12/20/23 12:26:49:822 GMT] 00000082 id=         com.ibm.tx.jta.impl.RecoveryManager                          E WTRN0026E: Exception caught during keypoint after recovery! com.ibm.ws.recoverylog.spi.InternalLogException: Log was already closed: /opt/ol/wlp/usr/shared/tranlog/tl-1ea2d7a4be0e/tranlog
    at com.ibm.ws.recoverylog.spi.LogHandle.keypointStarting(LogHandle.java:689)
    at com.ibm.ws.recoverylog.spi.MultiScopeRecoveryLog.keypoint(MultiScopeRecoveryLog.java:1592)
    at com.ibm.ws.recoverylog.spi.MultiScopeRecoveryLog.recoveryComplete(MultiScopeRecoveryLog.java:795)
    at com.ibm.ws.recoverylog.spi.RecoveryLogImpl.recoveryComplete(RecoveryLogImpl.java:90)
    at com.ibm.tx.jta.impl.RecoveryManager.run(RecoveryManager.java:2164)
    at java.lang.Thread.run(Thread.java:826)

[12/20/23 12:26:49:824 GMT] 00000082 id=         com.ibm.ws.recoverylog.spi.RecoveryDirectorImpl              I CWRLS0014I: Halting any current recovery processing for a peer WebSphere server (1ea2d7a4be0e).
[12/20/23 12:26:49:837 GMT] 0000007e id=         com.ibm.ws.logging.internal.impl.IncidentImpl                I FFDC1015I: An FFDC Incident has been created: "com.ibm.ws.recoverylog.spi.InvalidFailureScopeException com.ibm.tx.jta.impl.RecoveryManager.recoveryFailed 1547" at ffdc_23.12.20_12.26.49.7.log
[12/20/23 12:26:49:838 GMT] 0000007e id=         com.ibm.tx.jta.impl.TxRecoveryAgentImpl                      E CWRLS0024_EXC_DURING_RECOVERY 
                                                                                                               Home server recovery failed in peer environment
[12/20/23 12:26:49:839 GMT] 0000007e id=         com.ibm.ws.recoverylog.spi.RecoveryDirectorImpl              A WTRN0108I: HADB Peer Recovery failed for server with recovery identity 1ea2d7a4be0e

Every minutes after this first failure, there are new entries in the console:

[12/20/23 12:27:49:843 GMT] 00000080 id=         com.ibm.ws.recoverylog.spi.RecoveryDirectorImpl              I CWRLS0011I: Performing recovery processing for a peer WebSphere server (1ea2d7a4be0e).
[12/20/23 12:27:49:843 GMT] 00000080 id=         com.ibm.tx.jta.impl.TxRecoveryAgentImpl                      A WTRN0108I: Recovery initiated for server 1ea2d7a4be0e
[12/20/23 12:27:49:844 GMT] 00000085 id=         com.ibm.tx.jta.impl.RecoveryManager                          E WTRN0112E: An unexpected error occured whilst opening the recovery log. The log configuration was MultiScopeRecoveryLog:serverName=1ea2d7a4be0e:clientName=transaction:clientVersion=1:logName=tranlog:logDirectory=/opt/ol/wlp/usr/shared/tranlog/tl-1ea2d7a4be0e/tranlog:logIdentifier=1 @-998850140 [FileFailureScope: 1ea2d7a4be0e [1449779843]]
[12/20/23 12:27:49:853 GMT] 00000085 id=         com.ibm.ws.logging.internal.impl.IncidentImpl                I FFDC1015I: An FFDC Incident has been created: "com.ibm.ws.recoverylog.spi.InternalLogException: Log was already closed: /opt/ol/wlp/usr/shared/tranlog/tl-1ea2d7a4be0e/tranlog com.ibm.ws.recoverylog.spi.MultiScopeRecoveryLog 1135" at ffdc_23.12.20_12.27.49.0.log
[12/20/23 12:27:49:862 GMT] 00000085 id=         com.ibm.ws.logging.internal.impl.IncidentImpl                I FFDC1015I: An FFDC Incident has been created: "com.ibm.ws.recoverylog.spi.InternalLogException: Log was already closed: /opt/ol/wlp/usr/shared/tranlog/tl-1ea2d7a4be0e/tranlog com.ibm.ws.recoverylog.spi.MultiScopeRecoveryLog.closeLogImmediate 1173" at ffdc_23.12.20_12.27.49.1.log
[12/20/23 12:27:49:871 GMT] 00000085 id=         com.ibm.ws.logging.internal.impl.IncidentImpl                I FFDC1015I: An FFDC Incident has been created: "com.ibm.ws.recoverylog.spi.InternalLogException: Log was already closed: /opt/ol/wlp/usr/shared/tranlog/tl-1ea2d7a4be0e/partnerlog com.ibm.ws.recoverylog.spi.MultiScopeRecoveryLog 1135" at ffdc_23.12.20_12.27.49.2.log
[12/20/23 12:27:49:878 GMT] 00000085 id=         com.ibm.ws.logging.internal.impl.IncidentImpl                I FFDC1015I: An FFDC Incident has been created: "com.ibm.ws.recoverylog.spi.InternalLogException: Log was already closed: /opt/ol/wlp/usr/shared/tranlog/tl-1ea2d7a4be0e/partnerlog com.ibm.ws.recoverylog.spi.MultiScopeRecoveryLog.closeLogImmediate 1173" at ffdc_23.12.20_12.27.49.3.log
[12/20/23 12:27:49:881 GMT] 00000085 id=         com.ibm.ws.recoverylog.spi.RecoveryDirectorImpl              I CWRLS0014I: Halting any current recovery processing for a peer WebSphere server (1ea2d7a4be0e).
[12/20/23 12:27:49:881 GMT] 00000080 id=         com.ibm.tx.jta.impl.TxRecoveryAgentImpl                      E CWRLS0024_EXC_DURING_RECOVERY 
                                                                                                               Home server recovery failed in peer environment
[12/20/23 12:27:49:881 GMT] 00000080 id=         com.ibm.ws.recoverylog.spi.RecoveryDirectorImpl              A WTRN0108I: HADB Peer Recovery failed for server with recovery identity 1ea2d7a4be0e

Here is the transaction log configuration:

<transaction clientInactivityTimeout="12"
             totalTranLifetimeTimeout="15"
             recoveryGroup="jta.recoveryGroup" recoveryIdentity="${HOSTNAME}"
             transactionLogDirectory="/opt/ol/wlp/usr/shared/tranlog/tl-${HOSTNAME}/"/>

In attachment, the ffdc files: ffdc.zip

tschurins commented 10 months ago

I have also some logs with an additional exception, but I have no information about how to reproduce this one:

[2023-12-04T12:33:05.022+0000] 00000099 id=00000000 com.ibm.ws.recoverylog.spi.RecoveryDirectorImpl              I CWRLS0011I: Performing recovery processing for a peer WebSphere server (nwgcbisqm156f64785fc4rffl).
[2023-12-04T12:33:05.023+0000] 00000099 id=00000000 com.ibm.tx.jta.impl.TxRecoveryAgentImpl                      A WTRN0108I: Recovery initiated for server nwgcbisqm156f64785fc4rffl
[2023-12-04T12:33:05.063+0000] 00000194 id=00000000 com.ibm.ws.logging.internal.impl.IncidentImpl                I FFDC1015I: An FFDC Incident has been created: "com.ibm.ws.recoverylog.spi.InternalLogException: Log was already closed: /opt/ol/wlp/usr/shared/tranlog/tl-nwg-cbis-qm1-56f64785fc-4rffl/tranlog com.ibm.ws.recoverylog.spi.MultiScopeRecoveryLog 1593" at ffdc_23.12.04_12.33.05.0.log
[2023-12-04T12:33:05.094+0000] 00000194 id=00000000 com.ibm.ws.logging.internal.impl.IncidentImpl                I FFDC1015I: An FFDC Incident has been created: "com.ibm.ws.recoverylog.spi.InternalLogException: Log was already closed: /opt/ol/wlp/usr/shared/tranlog/tl-nwg-cbis-qm1-56f64785fc-4rffl/tranlog com.ibm.ws.recoverylog.spi.MultiScopeRecoveryLog.keypoint 1373" at ffdc_23.12.04_12.33.05.1.log
[2023-12-04T12:33:05.094+0000] 00000194 id=00000000 com.ibm.ws.recoverylog.spi.MultiScopeRecoveryLog             A CWRLS0008E: Recovery log is being marked as failed. [ 1 transaction ]
[2023-12-04T12:33:05.094+0000] 00000194 id=00000000 com.ibm.ws.recoverylog.spi.MultiScopeRecoveryLog             I CWRLS0009E: Details of recovery log failure: com.ibm.ws.recoverylog.spi.InternalLogException: Log was already closed: /opt/ol/wlp/usr/shared/tranlog/tl-nwg-cbis-qm1-56f64785fc-4rffl/tranlog
    at com.ibm.ws.recoverylog.spi.LogHandle.keypointStarting(LogHandle.java:689)
    at com.ibm.ws.recoverylog.spi.MultiScopeRecoveryLog.keypoint(MultiScopeRecoveryLog.java:1592)

[2023-12-04T12:33:05.239+0000] 00000194 id=00000000 com.ibm.ws.recoverylog.spi.RecoveryDirectorImpl              I CWRLS0014I: Halting any current recovery processing for a peer WebSphere server (nwgcbisqm156f64785fc4rffl).
[2023-12-04T12:33:05.273+0000] 00000099 id=00000000 com.ibm.ws.logging.internal.impl.IncidentImpl                I FFDC1015I: An FFDC Incident has been created: "java.lang.NullPointerException com.ibm.ws.recoverylog.spi.RecoveryDirectorImpl 1662" at ffdc_23.12.04_12.33.05.7.log
[2023-12-04T12:33:05.273+0000] 00000099 id=00000000 com.ibm.ws.recoverylog.spi.RecoveryDirectorImpl              A WTRN0108I: HADB Peer Recovery failed for server with recovery identity nwgcbisqm156f64785fc4rffl with exception java.lang.NullPointerException

Looking at the ffdc file, the stack trace is this one:

Stack Dump = java.lang.NullPointerException
    at com.ibm.tx.jta.impl.TxRecoveryAgentImpl.initiateRecovery(TxRecoveryAgentImpl.java:532)
    at com.ibm.ws.recoverylog.spi.RecoveryDirectorImpl.directInitialization(RecoveryDirectorImpl.java:612)
    at com.ibm.ws.recoverylog.spi.RecoveryDirectorImpl.peerRecoverServers(RecoveryDirectorImpl.java:1651)
    at com.ibm.tx.jta.impl.LeaseTimeoutManager$LeaseChecker.alarm(LeaseTimeoutManager.java:157)
    at com.ibm.tx.jta.util.alarm.AlarmManagerImpl$AlarmListenerWrapper.run(AlarmManagerImpl.java:75)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:826)

mogorzel commented 10 months ago

Maybe somehow related to https://github.com/OpenLiberty/open-liberty/issues/26832 handled by @neilgsyoung as seems to also start on 23.0.0.9.

neilgsyoung commented 10 months ago

Hi,

Yes, I did make changes in this area between 23.0.0.8 and 23.0.0.9 and would like to help work out what's going wrong.

-> Is the problem reproduceable? If so, please could you restart the server (pod) with Trace enabled. A trace string specified like this,

<logging traceSpecification="Transaction=all:*=info=enabled"/>

in your server.xml should provide the information I need in a trace.log file in your server's logs directory.

-> Do you have the full messages.log file for the run that you took those snippets from? If you could attach that here, I'd like to take a look. There's something curious going between local recovery (the recovery of the logs that the server owns) and peer recovery (where the server attempts to recover the logs of inactive, probably old, servers).

-> For completeness, would you be able to list the contents of your lease directory? Thats the directory in the filesystem that holds a file for each server known to the environment. I think, with your configuration, the filesystem directory will be /opt/ol/wlp/usr/shared/leases/jta.recoveryGroup

Thank you.

tschurins commented 9 months ago

log1.zip

Hello, we tried to reproduce with more logs - here is what we achieved. I hope that it will be enough to direct you at the right place...

neilgsyoung commented 8 months ago

Thank you - that definitely takes us forward...

-> The NullPointerException is ugly but should be fixed in the most recent code.

-> That lease directory image shows, I think, that the code is failing to tidy up after servers have shutdown. New server invocations should be recovering "old" logs, deleting their transaction and partner log files and then deleting the appropriate file in the lease directory. We don't expect to see multiple entries like that.

-> From the trace file, I can see the server attempting to recover the logs belonging to an "old" server. The attempt to do that repeatedly fails, something significant happened the first time the server tried to recover the "old" logs and they have been marked as "failed". I think there's a couple of things we can do here to improve the code,

(A) In this version of the code the failure to recover the logs of one server means that we don't attempt to recover any others. (B) If the peer logs have been marked as "failed" there is no point attempting to process them again and again.

-> I'm really keen to learn what happened when the new server first attempted to recover the logs of the peer. ie to see what caused the server to decide they should be "failed". Do you have any earlier trace or message logs? Startup trace would be ideal.

Thanks.

tschurins commented 8 months ago

Hello, Here are some more logs: cbis-cbis-69595fff76-logs.zip

I hope that this will be enough to shed the light on this issue :)

neilgsyoung commented 8 months ago

Thank you - I can make progress. I can immediately see one clear defect, so I've opened issue #27911 to address that.

Server startup is the key to understanding what's going on, the messages.log shows the defective behaviour at a gross level but unfortunately the trace files only show "info" level messages. We've lost the detail provided by that trace string...

<logging traceSpecification="Transaction=all:*=info=enabled"/>

...I can't work out why those redundant leases are not being deleted without that level of trace.

I can see that the server mistakenly (hence issue #27911) attempt to create new recovery log files for a redundant peer server. Later FFDCs show an attempt to delete those same files and the directory that contains them. File deletion succeeds but directory deletion fails with DirectoryNotEmptyExceptions. I can see those failures in the messages.log and ffdcs.

So, that's 3 problems that I can see in the diagnostic files,

(A) Creating new recovery log files for redundant leases - issue #27911 (B) An apparent failure to delete the redundant lease - we need the more detailed trace to see why that operation failed but I also think we may need to change the code to report the failure, assuming it is a failure. (C) A failure to delete the directory that holds the recovery logs - in this case more detailed trace would help, but it would be interesting to list what files are in those recovery log directories.

1/ Please could I ask you to repeat that server startup operation but with the fuller trace specification? 2/ Please could you list the contents of the recovery log directories for me? e.g. what do you see in directories like these,

/opt/ol/wlp/usr/shared/tranlog/tl-cbis-cbis-69595fff76-kssn8/tranlog /opt/ol/wlp/usr/shared/tranlog/tl-cbis-cbis-69595fff76-kssn8/partnerlog /opt/ol/wlp/usr/shared/tranlog/tl-cbis-cbis-69595fff76-s5jw8/tranlog /opt/ol/wlp/usr/shared/tranlog/tl-cbis-cbis-69595fff76-s5jw8/partnerlog /opt/ol/wlp/usr/shared/tranlog/tl-cbis-cbis-69595fff76-b58sd/tranlog /opt/ol/wlp/usr/shared/tranlog/tl-cbis-cbis-69595fff76-b58sd/partnerlog /opt/ol/wlp/usr/shared/tranlog/tl-cbis-cbis-69595fff76-kh4g6/tranlog /opt/ol/wlp/usr/shared/tranlog/tl-cbis-cbis-69595fff76-kh4g6/partnerlog

Thank you.

tschurins commented 6 months ago

Hello,

Please find attached logs with trace enabled. We have almost 2 GB of logs during POD startup. I have attached the logs during when the exceptions are observed. Last 2 messages log and trace log. Let me know in case you need any more logs, we can find a solution to upload those.

cbis-cbis-69595fff76-pjwxn-messages_log.zip cbis-cbis-69595fff76-pjwxn-trace_log.zip

Regarding second Question, for the old pods there is no content in the PartnerLog or Tranlog Directories. They are empty directories.

Tested with bash-5.1$ server version Open Liberty 23.0.0.11 (1.0.83.cl231120231030-1102) on Eclipse OpenJ9 VM, version 1.8.0_382-b05 (en_US)

jonhawkes commented 4 months ago

Neil no longer works on OpenLiberty. I will take a look.

OpenLiberty / open-liberty

On disk transaction logs recovery failing since 23.0.0.9 #27211