ansible / eda-server

Event Driven Ansible for AAP
Apache License 2.0
59 stars 35 forks source link

fix: set job id for scheduled jobs #931

Closed Alex-Izquierdo closed 1 month ago

Alex-Izquierdo commented 1 month ago

We have noticed that the lock mechanism in the scheduler is not 100% reliable and when multiple schedulers are executed at the exact same time they can duplicate the jobs. Set the job_id to a static value so we can make sure we schedule always only one job.

Jira: https://issues.redhat.com/browse/AAP-25019

jshimkus-rh commented 1 month ago

I'm concerned that the description, while not necessarily incorrect, glosses over a synchronization issue between parallel executions of, at least, enqueue_monitor_rulebook_processes. Also it seems to me (unless I've misunderstood) that these changes relative to the periodic tasks rely on the same mechanism the description calls out as problematic.

Alex-Izquierdo commented 1 month ago

@jshimkus-rh I think you might be mixing two different mechanisms, the scheduler lock system when it schedules the job, and the mechanism of the rqworker to fetch/update the job. We have not observed issues at the second one so far. In this case we were doing it wrong regardless of the lock system of the scheduler when we execute https://github.com/ansible/eda-server/blob/main/src/aap_eda/core/management/commands/scheduler.py#L111

I believe we should have had the job_id defined from the beginning, just we didn't hit the issue until start to play with multiple schedulers.

I agree that the description might be inaccurate, it suggests that is a problem of the scheduler, but the scheduler ensures that you can not have two instances setting the same job, here the problem is that prior to run the scheduler we initially created twice the job.

In this screenshot you can see how looks redis between this line and this line

Screenshot from 2024-06-06 13-47-09

jshimkus-rh commented 1 month ago

@jshimkus-rh I think you might be mixing two different mechanisms, the scheduler lock system when it schedules the job, and the mechanism of the rqworker to fetch/update the job. We have not observed issues at the second one so far.

I don't think so. And I do agree with the following...

I believe we should have had the job_id defined from the beginning, just we didn't hit the issue until start to play with multiple schedulers.

But that's also the source of my angst. I don't like the idea of depending on an external entity to prevent us from doing something that we shouldn't be doing; we should do that. But I don't see the need to start a holy war about it. I'll just have to live with it.