JasperFx / marten

.NET Transactional Document DB and Event Store on PostgreSQL
https://martendb.io
MIT License
2.75k stars 429 forks source link

Integration tests using WAF randomly fails after version 7.12.0 #3237

Closed baltie closed 1 month ago

baltie commented 2 months ago

We have a large number of integration tests using WebApplicationFramework (WAF) and XUnit and test collections in XUnit. These have worked fine up until version 7.12.0, but after that, they are all failing intermittently, both locally and in our CI/CD workflows. That is, either all fail, or all succeed.

I've tracked down the problem to be the changes introduced in PR #3190 and how _cancellation in ProjectionCoordinator is now handled.

What happens is this:

  1. After all tests in our test collection are done, a teardown procedure is started
  2. Due to a bug/problem (feature?) in WAF, the host is disposed of twice. See https://github.com/dotnet/aspnetcore/issues/40271
  3. The ProjectionCoordinator is also then stopped twice, which means StopAsync is called twice.
  4. PauseAsync is then called twice, where the first one succeeds, but the next one fails because the _cancellation CancellationTokenSource has been cancelled and disposed by the first call to PauseAsync.

If the whole test suit fails or not depends on which one of the execution paths fails - if it is the one that XUnit monitors.

It seems to me that Microsoft is not inclined to fix the double dispose. Earlier PRs to fix the matter have been closed and not merged, and the proposed solution is usually to fix the issue in the hosted service so that it handles being stopped twice.

This wasn't a problem with ProjectionCoordinator before, though, until the changes mentioned around pause and resume.

The fix should be simple. There is no need to dispose the CTS right after the cancel in PauseAsync() - you can instead call _cancellation?.Dispose() in StartAsync() before _cancellation = new(). A CTS that is not linked nor uses a timer is not really using any unmanaged resources, so it doesn't really matter if it is left to live until StartAsync() is called again.

jeremydmiller commented 2 months ago

@baltie I think this is an "I take pull requests" with a relatively quick turnaround

baltie commented 2 months ago

@jeremydmiller I've made a PR for this now. I've tested it, and all our integration tests work after the change.

jeremydmiller commented 2 months ago

Okay, I'll try to get that out tomorrow maybe

baltie commented 1 month ago

Thank you! 😃