atomikos / transactions-essentials

Development repository for next major release of
https://www.atomikos.com/Main/TransactionsEssentials
Other
461 stars 139 forks source link

Spring Boot + Transaction Essentials 6.0.0 M2 can hang on shutdown when included but unused #178

Open jim-jobcase opened 1 year ago

jim-jobcase commented 1 year ago

Sharing this because there was a behavioral change with Atomikos with the latest 6.0 milestone. The original scenario is debatable, but the behavior was surprising.

I've got a library that includes the Atomikos Essentials dependency but doesn't always use JTA. With 6.0.0M2 some of our tests will hang on shutdown for 5 minutes. I've stripped this down to an example which is included.

atomikos-test-case.tar.gz

If you run mvn clean test with this project as is, it will hang for 5 minutes. If you modify the pom.xml and update the Atomikos version to 6.0.0M1 and re-run the test, the test will execute and shutdown in a couple of seconds.

This particular test case doesn't enable JTA - see src/test/java/com/example/ApplicationForTest.java, where it disables various persistence related classes but not AtomikosAutoConfiguration.class. If you add AtomikosAutoConfiguration.class to the exclude list with the 6.0.0M2 version it will shutdown quickly.

I can work with fix of adding the AtomikosAutoConfiguration.class exclusion where needed to my project, but I was still surprised that there was a 5 minute wait on shutdown when nothing touched or used JTA other than the Atomikos Auto Configuration.

Based on a little debugging, it looks like TransactionServiceImp.shutdown() will call performRecovery(), but it will sleep for the max timeout even if the number of active transactions is 0. Shouldn't an active transaction count be included in that check?

Thank you in advance for any feedback and/or fixes!

GuyPardon commented 1 year ago

Hi,

Good observation :-)

Indeed max_timeout is waited for because it is the ultimate delay used for recovery.

Checking for active transactions being 0 is not sufficient, because there may be pending transactions in backend resources from previous application runs. Those would also need cleaning up by recovery, possibly during shutdown.

My guess is most applications will have at least performed on transaction that may leave pending work in the backends. If that hypothesis is not workable then we might look at more sophisticated ways to check.

What do you think?

Thanks

jim-jobcase commented 1 year ago

As a practical matter, I see a default 5 minute wait on shutdown as a one that will result in bug reports for you. Most deployment automation setups I've worked with have a timeout of 30-60 seconds, so unless the application developer is aware of this need and tunes their configuration further, the recovery process is likely to never complete. It may help to at least log what transactions you're aware of that are still pending to justify the wait to anyone reading the logs.

I did spend a bit more time going through the source code so I could better understand your comment on pending transactions on the backend. I found that the TransactionServiceImp class calls RecoveryDomainService.performRecovery(), which looks like the beginning of that process. In that method once you are through the body of performing the operation, wouldn't the allOk variable with a value of true indicate that there are no backend resources with pending transactions, or if there were, they are now recovered? If so, it seems like if that were returned, it could help indicate that no additional wait is necessary. Of course, I am new to this code and may be misunderstanding something, so many apologies if I'm oversimplifying things!

Thank you

GuyPardon commented 1 year ago

Hi,

The allOK check is there to signal what records can be deleted from the transaction's commit logs. It does not say anything about pending transactions in any resource.

Let me check if anything can be tuned.

Thanks,