JobScheduler keeps restarting

IBM / core-dump-handler

Save core dumps from a Kubernetes Service or RedHat OpenShift to an S3 protocol compatible object store

https://ibm.github.io/core-dump-handler/

MIT License

131 stars 40 forks source link

JobScheduler keeps restarting #120

Closed marcelrend closed 1 year ago

marcelrend commented 1 year ago

I've tried to use the interval and schedule options instead of inotify. The tokio JobScheduler is able to create the job successfully but the container exits immediately after that.

It looks like the sched.start().await command doesn't work. Is this still working for others?

I've fixed it by adding this loop, but note that the container crashed a few times on the tick and I'm not sure why. Besides that it was working as expected.

loop {
    match sched.tick().await {
        Ok(v) => v,
        Err(e) => {
            error!("Tick problem {:#?}", e);
            panic!("Tick problem, {:#?}", e);
        }
    };
    std::thread::sleep(Duration::from_secs(10));
}

I'm now switching back to inotify so for me this is not a problem anymore, but perhaps others are helped by this finding.

I'm running Openshift 4.10 on premises, using the latest image (quay.io/icdh/core-dump-handler:v8.8.0).

No9 commented 1 year ago

Thanks for reporting this @MrMarshall There have been a lot of changes in the tokio scheduler since this project started and while we have tests to confirm the creation of the schedule works we obviously we need more coverage. Ill take a look over the next few days and try to understand this a more.

No9 commented 1 year ago

I think this is breaking it https://github.com/IBM/core-dump-handler/blob/main/core-dump-agent/src/main.rs#L194 as tokio scheduler now has a separate new_async handler I am going to migrate to it. https://github.com/mvniekerk/tokio-cron-scheduler/blob/main/examples/lib.rs#L43 I'm also going to add a scheduler integration test so this will be caught pre-release.

marcelrend commented 1 year ago

Nicely found :) Thanks @No9

No9 commented 1 year ago

OK @MrMarshall I've created a fix on this branch https://github.com/IBM/core-dump-handler/tree/scheduler-fix Would you have time to pull the branch and verify?

The image is prebuilt and tagged scheduler-fix so it's just a matter of cloning the scheduler-fix branch and changing the schedule to what you want it to be. https://github.com/IBM/core-dump-handler/blob/scheduler-fix/charts/core-dump-handler/values.yaml#L44

[Edit] I have verified it on IKS but it would be good to confirm on OpenShift 4.10

marcelrend commented 1 year ago

@No9 I still get the same behavior unfortunately. I've checked out the scheduler-fix branch, confirmed your commits were in it and built a new docker image.

Any idea what could cause this? I wasn't expecting this if you were able to reproduce and fix the problem.

I can try to fix it myself after my Christmas holiday if you like :)

No9 commented 1 year ago

@MrMarshall It also looks like the scheduler no longer blocks so the main execution path is falling through to an exit. I've put a blocking loop in and updated.

marcelrend commented 1 year ago

@No9 awesome, works as expected now! Oh and thanks for building this app, it's saved us a lot of time and headaches!

No9 commented 1 year ago

Excellent - Thanks for looking at it before the break I'll package and do a release over the weekend. Really appreciate the kind feedback too. Makes it all worthwhile.

No9 commented 1 year ago

Closing as this fix is in release https://github.com/IBM/core-dump-handler/releases/tag/v8.9.0 Thanks again for all the validation @MrMarshall and please open an issue if something else crops up.