Anthony-Nolan / Atlas

A free & open-source Donor Search Algorithm Service
GNU General Public License v3.0
9 stars 5 forks source link

Integration tests intermittently timeout after an hour #756

Open benbelow opened 2 years ago

benbelow commented 2 years ago

Seen periodically on AN's system tests job in CI server - tests that usually take minutes will time out after the Azure devops timeout limit of 60 mins.

Never seen locally.

Most commonly seen with the donor import's tests, but also seen elsewhere at times.

benbelow commented 2 years ago

Investigation commenst copied from AN's JIRA:

Ongoing Investigation notes Running the systems test pipeline ever hour (via feature branch) as donor import doesn’t fail every time.

Also reduced the timeout for donor import tests to 10 mins as they only take 2.5 mins to complete when they succeed, no need to wait for a full hour to timeout.

Increased dotnet test -v logging to level diag and added --blameto capture crash dump if/when the tests do fail due to timeout. At the very least, this should tell us which test was running at point of timeout.

Of the many runs over the course of a day, only this one actually had the donor import timeout:

https://dev.azure.com/anthony-nolan-nova/Atlas/_build/results?buildId=11798&view=logs&j=621284ed-f660-537d-babb-3e69286e73e1

diag option means the log is now huge, need to go through it, but unfortunately the --blame option didn’t save anything to the run as an artefact.

Edit

Delete

Zabeen Patel 30 March 2021, 11:17 Last test that passed before timeout:

29/03 - ImportDonors_WhenLastDonorInFileFails_DoesNotSendNotificationForEarlierDonor

25/03 - GetDonorsByIds_ReturnsSelectedDonors

16/03 - ImportDonors_WhenLastDonorInFileFails_DoesNotSendNotificationForEarlierDonor

15/03 - GetDonorsByIds_ReturnsSelectedDonors

13/03 - ImportDonors_WhenLastDonorInFileFails_DoesNotSendNotificationForEarlierDonor

Edit

Delete

Zabeen Patel 30 March 2021, 11:03 I forced the test suites to run in series by limiting the number of suitable agents that can run the systems test pipeline to 1 (method). The first 3 runs were fine, but the timeout issue occurred on the 4th attempt, on the donor import test suite.

Edit

Delete

Benjamin Below 26 March 2021, 11:58 Actually another idea just sprung to mind - can we enforce that these run in series rather than in parallel and see if we can repro then? That might rule out any kind of resource locking by the parallel suites…

Edit

Delete

Benjamin Below 26 March 2021, 11:46 Zabeen Patel I’ve tried to keep this card up to date with my experiments, but feel free to ping me for any more details (for which I may not be very useful, as this issue has me stumped :disappointed: )

Edit

Delete

Benjamin Below 16 March 2021, 13:24 Edited Reduced bulk insert timeout to 5m - no change to the tests, so this isn’t the specific issue. (but worth leaving in as batches of 10,000 inserts seem to take sub 1 minute in practice, so the existing value of 1 hour was misplaced.)

Connection string timout:

Is set to 30min already, so if this were the issue I’d expect to see a timeout exception at least once.

Next step is probably some more verbose logging!

Edit

Delete

Benjamin Below 15 March 2021, 18:31 Next idea - sql timeouts. Try reducing from 1 hour to something very small and seeing if we start to see better exceptions at least?

Edit

Delete

Benjamin Below 15 March 2021, 17:51 https://dev.azure.com/anthony-nolan-nova/Atlas/_build/results?buildId=11520&view=results appears to still hit the timeout, even with the perf tests disabled via renaming the env var

Edit

Delete

Benjamin Below 11 March 2021, 10:37 Timeboxed at an initial 2 days

Edit

Delete

Benjamin Below 10 March 2021, 17:55 Interesting observation - on removing the import donors perf tests, this has timed out again twice in a row, on different components!

This is very strange and implies something more generic than a specific bad test.

Next step - I’d like to disable all tests using the “IgnoreExceptOnCIPerf” custom attribute, in case that’s somehow causing this. If it consistently passes without them, a pragmatic approach right now would be to fully ignore all such tests and raise a tech debt card to fix the perf tests/attributes, as they’re non-functional testing and not mission critical to delivering MVP

Edit

Delete

Benjamin Below 10 March 2021, 13:51 This is happening often enough that it’s worth bumping to an MVP essential! Gonna remove the Atlas 0 points as we may want to resestimate as inherited for the Nova team.

Edit

Delete

Benjamin Below 2 October 2020, 16:19 Appears to get to the test before this one before hanging for most of the hour:

ImportDonors_AllValid_Performance

Edit

Delete

Sam Seed (not a request participant) 15 September 2020, 10:16 Mainly donor import but now also the MPA suite has seen a failure.

Edit

Delete

Benjamin Below 9 September 2020, 17:42 Not seen this in several days, and not sure how to repro. Moving to test blocked until we see it again.

mmelchers commented 1 year ago

I do not know whether this bug is still present.

@seanmobrien @zabeen could you please check and close if necessary?