Open tschurins opened 10 months ago
I have also some logs with an additional exception, but I have no information about how to reproduce this one:
[2023-12-04T12:33:05.022+0000] 00000099 id=00000000 com.ibm.ws.recoverylog.spi.RecoveryDirectorImpl I CWRLS0011I: Performing recovery processing for a peer WebSphere server (nwgcbisqm156f64785fc4rffl).
[2023-12-04T12:33:05.023+0000] 00000099 id=00000000 com.ibm.tx.jta.impl.TxRecoveryAgentImpl A WTRN0108I: Recovery initiated for server nwgcbisqm156f64785fc4rffl
[2023-12-04T12:33:05.063+0000] 00000194 id=00000000 com.ibm.ws.logging.internal.impl.IncidentImpl I FFDC1015I: An FFDC Incident has been created: "com.ibm.ws.recoverylog.spi.InternalLogException: Log was already closed: /opt/ol/wlp/usr/shared/tranlog/tl-nwg-cbis-qm1-56f64785fc-4rffl/tranlog com.ibm.ws.recoverylog.spi.MultiScopeRecoveryLog 1593" at ffdc_23.12.04_12.33.05.0.log
[2023-12-04T12:33:05.094+0000] 00000194 id=00000000 com.ibm.ws.logging.internal.impl.IncidentImpl I FFDC1015I: An FFDC Incident has been created: "com.ibm.ws.recoverylog.spi.InternalLogException: Log was already closed: /opt/ol/wlp/usr/shared/tranlog/tl-nwg-cbis-qm1-56f64785fc-4rffl/tranlog com.ibm.ws.recoverylog.spi.MultiScopeRecoveryLog.keypoint 1373" at ffdc_23.12.04_12.33.05.1.log
[2023-12-04T12:33:05.094+0000] 00000194 id=00000000 com.ibm.ws.recoverylog.spi.MultiScopeRecoveryLog A CWRLS0008E: Recovery log is being marked as failed. [ 1 transaction ]
[2023-12-04T12:33:05.094+0000] 00000194 id=00000000 com.ibm.ws.recoverylog.spi.MultiScopeRecoveryLog I CWRLS0009E: Details of recovery log failure: com.ibm.ws.recoverylog.spi.InternalLogException: Log was already closed: /opt/ol/wlp/usr/shared/tranlog/tl-nwg-cbis-qm1-56f64785fc-4rffl/tranlog
at com.ibm.ws.recoverylog.spi.LogHandle.keypointStarting(LogHandle.java:689)
at com.ibm.ws.recoverylog.spi.MultiScopeRecoveryLog.keypoint(MultiScopeRecoveryLog.java:1592)
[2023-12-04T12:33:05.239+0000] 00000194 id=00000000 com.ibm.ws.recoverylog.spi.RecoveryDirectorImpl I CWRLS0014I: Halting any current recovery processing for a peer WebSphere server (nwgcbisqm156f64785fc4rffl).
[2023-12-04T12:33:05.273+0000] 00000099 id=00000000 com.ibm.ws.logging.internal.impl.IncidentImpl I FFDC1015I: An FFDC Incident has been created: "java.lang.NullPointerException com.ibm.ws.recoverylog.spi.RecoveryDirectorImpl 1662" at ffdc_23.12.04_12.33.05.7.log
[2023-12-04T12:33:05.273+0000] 00000099 id=00000000 com.ibm.ws.recoverylog.spi.RecoveryDirectorImpl A WTRN0108I: HADB Peer Recovery failed for server with recovery identity nwgcbisqm156f64785fc4rffl with exception java.lang.NullPointerException
Looking at the ffdc file, the stack trace is this one:
Stack Dump = java.lang.NullPointerException
at com.ibm.tx.jta.impl.TxRecoveryAgentImpl.initiateRecovery(TxRecoveryAgentImpl.java:532)
at com.ibm.ws.recoverylog.spi.RecoveryDirectorImpl.directInitialization(RecoveryDirectorImpl.java:612)
at com.ibm.ws.recoverylog.spi.RecoveryDirectorImpl.peerRecoverServers(RecoveryDirectorImpl.java:1651)
at com.ibm.tx.jta.impl.LeaseTimeoutManager$LeaseChecker.alarm(LeaseTimeoutManager.java:157)
at com.ibm.tx.jta.util.alarm.AlarmManagerImpl$AlarmListenerWrapper.run(AlarmManagerImpl.java:75)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:826)
Maybe somehow related to https://github.com/OpenLiberty/open-liberty/issues/26832 handled by @neilgsyoung as seems to also start on 23.0.0.9.
Hi,
Yes, I did make changes in this area between 23.0.0.8 and 23.0.0.9 and would like to help work out what's going wrong.
-> Is the problem reproduceable? If so, please could you restart the server (pod) with Trace enabled. A trace string specified like this,
<logging traceSpecification="Transaction=all:*=info=enabled"/>
in your server.xml should provide the information I need in a trace.log file in your server's logs directory.
-> Do you have the full messages.log file for the run that you took those snippets from? If you could attach that here, I'd like to take a look. There's something curious going between local recovery (the recovery of the logs that the server owns) and peer recovery (where the server attempts to recover the logs of inactive, probably old, servers).
-> For completeness, would you be able to list the contents of your lease directory? Thats the directory in the filesystem that holds a file for each server known to the environment. I think, with your configuration, the filesystem directory will be /opt/ol/wlp/usr/shared/leases/jta.recoveryGroup
Thank you.
Hello, we tried to reproduce with more logs - here is what we achieved. I hope that it will be enough to direct you at the right place...
Thank you - that definitely takes us forward...
-> The NullPointerException is ugly but should be fixed in the most recent code.
-> That lease directory image shows, I think, that the code is failing to tidy up after servers have shutdown. New server invocations should be recovering "old" logs, deleting their transaction and partner log files and then deleting the appropriate file in the lease directory. We don't expect to see multiple entries like that.
-> From the trace file, I can see the server attempting to recover the logs belonging to an "old" server. The attempt to do that repeatedly fails, something significant happened the first time the server tried to recover the "old" logs and they have been marked as "failed". I think there's a couple of things we can do here to improve the code,
(A) In this version of the code the failure to recover the logs of one server means that we don't attempt to recover any others. (B) If the peer logs have been marked as "failed" there is no point attempting to process them again and again.
-> I'm really keen to learn what happened when the new server first attempted to recover the logs of the peer. ie to see what caused the server to decide they should be "failed". Do you have any earlier trace or message logs? Startup trace would be ideal.
Thanks.
Hello, Here are some more logs: cbis-cbis-69595fff76-logs.zip
I hope that this will be enough to shed the light on this issue :)
Thank you - I can make progress. I can immediately see one clear defect, so I've opened issue #27911 to address that.
Server startup is the key to understanding what's going on, the messages.log shows the defective behaviour at a gross level but unfortunately the trace files only show "info" level messages. We've lost the detail provided by that trace string...
<logging traceSpecification="Transaction=all:*=info=enabled"/>
...I can't work out why those redundant leases are not being deleted without that level of trace.
I can see that the server mistakenly (hence issue #27911) attempt to create new recovery log files for a redundant peer server. Later FFDCs show an attempt to delete those same files and the directory that contains them. File deletion succeeds but directory deletion fails with DirectoryNotEmptyExceptions. I can see those failures in the messages.log and ffdcs.
So, that's 3 problems that I can see in the diagnostic files,
(A) Creating new recovery log files for redundant leases - issue #27911 (B) An apparent failure to delete the redundant lease - we need the more detailed trace to see why that operation failed but I also think we may need to change the code to report the failure, assuming it is a failure. (C) A failure to delete the directory that holds the recovery logs - in this case more detailed trace would help, but it would be interesting to list what files are in those recovery log directories.
1/ Please could I ask you to repeat that server startup operation but with the fuller trace specification? 2/ Please could you list the contents of the recovery log directories for me? e.g. what do you see in directories like these,
/opt/ol/wlp/usr/shared/tranlog/tl-cbis-cbis-69595fff76-kssn8/tranlog /opt/ol/wlp/usr/shared/tranlog/tl-cbis-cbis-69595fff76-kssn8/partnerlog /opt/ol/wlp/usr/shared/tranlog/tl-cbis-cbis-69595fff76-s5jw8/tranlog /opt/ol/wlp/usr/shared/tranlog/tl-cbis-cbis-69595fff76-s5jw8/partnerlog /opt/ol/wlp/usr/shared/tranlog/tl-cbis-cbis-69595fff76-b58sd/tranlog /opt/ol/wlp/usr/shared/tranlog/tl-cbis-cbis-69595fff76-b58sd/partnerlog /opt/ol/wlp/usr/shared/tranlog/tl-cbis-cbis-69595fff76-kh4g6/tranlog /opt/ol/wlp/usr/shared/tranlog/tl-cbis-cbis-69595fff76-kh4g6/partnerlog
Thank you.
Hello,
Please find attached logs with trace enabled. We have almost 2 GB of logs during POD startup. I have attached the logs during when the exceptions are observed. Last 2 messages log and trace log. Let me know in case you need any more logs, we can find a solution to upload those.
cbis-cbis-69595fff76-pjwxn-messages_log.zip cbis-cbis-69595fff76-pjwxn-trace_log.zip
Regarding second Question, for the old pods there is no content in the PartnerLog or Tranlog Directories. They are empty directories.
Tested with bash-5.1$ server version Open Liberty 23.0.0.11 (1.0.83.cl231120231030-1102) on Eclipse OpenJ9 VM, version 1.8.0_382-b05 (en_US)
Neil no longer works on OpenLiberty. I will take a look.
Our JEE application is running with OpenLiberty, and is deployed in docker and Kubernetes. We have XA transactions and have configured the transaction logs to be saved on a shared disk. When stopping pods and starting new ones, we see a lot of errors popping up in the console. Note that everything was working correctly in version 23.0.0.8. Note also that when we start new pods, they have a new identity.
Here is the console log. The successfull recovery of the existing logs:
Followed by these errors:
Every minutes after this first failure, there are new entries in the console:
Here is the transaction log configuration:
In attachment, the ffdc files: ffdc.zip