fix: fix race condition when schedule activation tasks

Alex-Izquierdo commented 1 month ago

There is a race condition when starting/restarting activations if the activations are processed by batches. It can be reproduced in slow VM's if you create a bunch of activations. Some activations go into failed state and follows the missing container policy leaving an orphaned pod (jira). It is already covered by the [e2e test suite] in (https://github.com/ansible/eda-qa/blob/main/eda_qa/tests/multinode-upstream/test_multinode.py) but difficult to reproduce in the existing CI environment.

https://github.com/ansible/eda-server/pull/894 helped but didn't fix it totally. After investigations, I found out that there is an overlap between the user requests enqueued by views and the monitor. This double entrypoint for user requests doesn't make sense IMO since it is only to obtain a couple of seconds of responsiveness and user requests are going to be processed anyway by the monitor.

This PR removes the enqueuing from the views and leverage completely the processing of the user requests to the monitor task. With only once piece in charge to process the user requests we remove overlaps and the window for the race condition.

I have also noticed a potential issue when running the monitor, as it is scheduled periodically and if for some reason it takes more time if we have more default workers (the expected scenario) another instance of the monitor will be running with unexpected results. The cronjob now runs a wrapper that schedules the monitor using "unique_enqueue" so we can make sure there is only one instance of the monitor running at once.

It also adds atomicity at the creation of the activation instance to remove another (small) race condition.

Another jira: https://issues.redhat.com/browse/AAP-24696

Alex-Izquierdo commented 1 month ago

The e2e multinode tests need an update because now a restarted activation in pending status has the default message before processing. The patch is in https://github.com/ansible/eda-qa/pull/378 Here is the successful execution: https://github.com/ansible/eda-server/actions/runs/9274871163/job/25518243154

Alex-Izquierdo commented 1 month ago

Successful E2E from https://github.com/ansible/eda-qa/pull/378 in https://github.com/ansible/eda-server/actions/runs/9290443719

Alex-Izquierdo commented 1 month ago

Don't merge it yet, please. I'm doing more tests to be sure what is all we need to fix the problem.

ansible / eda-server

fix: fix race condition when schedule activation tasks #916