distributed-system-analysis / pbench

A benchmarking and performance analysis framework
http://distributed-system-analysis.github.io/pbench/
GNU General Public License v3.0
188 stars 108 forks source link

Use `systemctl` to launch the Pbench Server container when deploying #3516

Closed webbnh closed 1 year ago

webbnh commented 1 year ago

Since we use systemctl to run the containerized Pbench Server in the Staging deployment (and since we'll do the same in Production), we should use the same mechanism when running the Server for functional test or development.

This PR modifies the server/pbenchinacan/deploy script to generate a Systemd unit configuration file and enable the service to start the canned Pbench Server. And, it modifies the cleanup code in jenkins/run-server-func-tests to use systemctl commands to shut it down again.

This PR is split into multiple commits: the first is just a lint pass over the two modules; the others contain the actual changes. (They'll be squashed together for the actual merge.)

PBENCH-1116

webbnh commented 1 year ago

I do wonder about the fallout of this extra failure mode on the Jenkins executors: the risk of leaving a service enabled, and the impact on a subsequent CI job scheduled on the same executor ...

Yep, I'm right there with you. And, I just checked, and the service is still there, "loaded" and "inactive (dead)" having exited successfully...I don't know whether that corresponds to "enabled" or "disabled"....

Nevertheless, I'm expecting/hoping that the --force flag on the enable command will overcome any resistance from this, so that subsequent jobs are unaffected. (My principal concern was that systemctl would restart the Pbench Server and keep it running (I saw this in my test environment, much to my bemusement!)...I'm not sure whether I'm happy or unhappy to find that it didn't in the CI...perhaps this is the result of the lack of the "linger" option?)

is it really worthwhile and worth that "contamination" risk?

I have the same concern.

What we really want is an automatic service disable at logout of the CI job session (and even that would be a problem if we ever enable parallel jobs on an executor).

I'm hoping that I can add something to the end of the Jenkins pipeline file to effect that (I don't think we're ever going to enable more than one Jenkins job on a single executor...). 🤞

dbutenhof commented 1 year ago

I do wonder about the fallout of this extra failure mode on the Jenkins executors: the risk of leaving a service enabled, and the impact on a subsequent CI job scheduled on the same executor ...

Yep, I'm right there with you. And, I just checked, and the service is still there, "loaded" and "inactive (dead)" having exited successfully...I don't know whether that corresponds to "enabled" or "disabled"....

Nevertheless, I'm expecting/hoping that the --force flag on the enable command will overcome any resistance from this, so that subsequent jobs are unaffected. (My principal concern was that systemctl would restart the Pbench Server and keep it running (I saw this in my test environment, much to my bemusement!)...I'm not sure whether I'm happy or unhappy to find that it didn't in the CI...perhaps this is the result of the lack of the "linger" option?)

Yeah; without "linger" the service should be killed once the user no longer has any active login session. Which is OK so long as we can't have overlapping login sessions we don't really want to be continuous. Since we have the executors single-streamed, that's probably not a problem unless someone with executor access leaves an interactive session logged in across the end of a CI job! (But if that happens ...)

And while --runtime might serve as a "last ditch blade guard", it's pretty far downstream since the executors don't restart often. We'll be breaking CI jobs with obsolete containers claiming ports long before we get there. To the point where I don't see how it can be considered helpful at all. It just seems to complicate the model.

is it really worthwhile and worth that "contamination" risk?

I have the same concern.

What we really want is an automatic service disable at logout of the CI job session (and even that would be a problem if we ever enable parallel jobs on an executor).

I'm hoping that I can add something to the end of the Jenkins pipeline file to effect that (I don't think we're ever going to enable more than one Jenkins job on a single executor...). 🤞

There's an interesting distinction between "stop" and "disable" ... without "linger" the service is supposed to shut down when there are no active user sessions, but I think the service remains registered and will restart automatically on the next login. This is probably what you observed. What we want is probably more like a non-linger start without an enable, which I think would mean that even if it stays "known" it wouldn't re-start automatically on login.

Ultimately, for real deployment we want to enable a service, but for testing and CI I'm not sure we want to do more than start and stop. Which brings into question whether "consolidating" on using systemd even makes sense. They're not going to be the same, because in testing the contamination of retaining a service registration past the session is "evil" whereas for deployment it's "essential"...

webbnh commented 1 year ago

that's probably not a problem unless someone with executor access leaves an interactive session logged in across the end of a CI job!

Ha! That's an angle that I hadn't considered.... 😨

while --runtime might serve as a "last ditch blade guard", it's pretty far downstream since the executors don't restart often.

That's a fair point.

We'll be breaking CI jobs with obsolete containers claiming ports long before we get there.

I don't think that that is a problem: given that we're running the Pbench Server via the pbench-server service, to the extent that there is a conflict, it should show up at the service enablement...we shouldn't get far enough that we start a second instance of the server, so there should never be port conflicts. (And, I'm hoping that the use of --force on the enable command will help resolve any conflicts favorably.)

I think the service remains registered and will restart automatically on the next login.

Aside from the fact that I'm unclear on whether Jenkins jobs count as "logins", I just demonstrated that logging in interactively after a job exits does not cause the service to restart automatically.

This is probably what you observed.

No...what I observed on my test machine was the systemctl disable command failing followed by the script issuing podman commands to (successfully) kill the container followed by the container miraculously reappearing (presumably having been restarted by systemctl when it noticed the container stopping). That is, everything seemed to be working as intended, even if it was not what I mistakenly expected at the time (i.e., I didn't realize that the systemctl command had failed, but, armed with that knowledge, the result is not surprising).

What we want is probably more like a non-linger start without an enable

Although the man page seems to indicate that one can use start without a prior enable, I couldn't figure out how to actually do it. (If you've got a clue, I would love to hear it, as I would prefer to skip the enable in the func-test case....)

webbnh commented 1 year ago

OK, so the tests passed this time, which included successfully superseding the systemctl configuration left over from previous run, and the output looks eminently reasonable. So, I think the present approach will work.

I'll see if I can refactor deploy and pry the redeploy actions out of it. I'm on the fence but unenthusiastic about creating an undeploy script, though...@dbutenhof, you want to nudge me either way?

dbutenhof commented 1 year ago

OK, so the tests passed this time, which included successfully superseding the systemctl configuration left over from previous run, and the output looks eminently reasonable. So, I think the present approach will work.

I'll see if I can refactor deploy and pry the redeploy actions out of it. I'm on the fence but unenthusiastic about creating an undeploy script, though...@dbutenhof, you want to nudge me either way?

I'm entirely on the fence; unless someone nudges me one way or the other I'll be unable to decide whether I have any real interest in nudging you. I thought having both as scripts would be consistent, but I'm not really sure how or why we'd use an undeploy separately. (I do however like your idea of pulling the dashboard out into a "redeploy" subsidiary, because that would be handy. We could even put it into a service file to automatically redeploy on a regular basis... although a Jenkins or hosted GitHub CD action would probably be a better choice.)

webbnh commented 1 year ago

OK, this is now ready for review.