Closed dtenenba closed 6 years ago
It seems that the above is still not the whole story. Even after starting the server in the manner described above, sometimes it stops receiving messages. Investigating further.
One annoying this is that I cannot reproduce this with the local dev setup (https://github.com/Bioconductor/packagebuilder/blob/master/documentation/Developing.md) . It only seems to happen in the live production configuration.
OK, update.
My next theory was that server.py, after servicing that initial message, gets into some state where it fails to see subsequent messages. I created a fork, the misnamed (or misnumbered) https://github.com/Bioconductor/packagebuilder/tree/feature/debug_issue_52 in which on_message() dispatches the bulk of its work to the do_work()
function, run in a new thread. This seems to fix the issue though as we have learned it is probably too soon to say.
At any rate, I have checked out the build nodes to that branch and if it continues to work we can merge it.
The same change might need to be made to the other code that listens for messages (archiver.py and track_build_completion.py).
I guess I will leave this issue open for now. I'll close it after it's merged into master if that's what ends up happening.
Another update. Forgot to clarify before that windows is working again; the threading fix described above took care of it.
Also, it appeared to be necessary to update track_build_completion.py in the same way. I did so in the branch https://github.com/Bioconductor/spb_history/tree/feature/server_dispatch_to_thread . However, the working copy of spb_history on staging is checked out to a local-only branch called changes-on-staging, and I am not able to merge, so I just manually pasted the new contents of track_build_completion.py. We can resolve this in a less messy way on Monday.
So my theory overall is that after switching to the stomp.py library (which behaves fairly differently to the old stomp library--whereas the old one was single-threaded, the new one calls the callback in a different thread) the listeners always or at least often went into a non-responsive state after the first message. It seems that having the on_message() method of the listener dispatch to a new thread fixes this. It hasn't proved to be necessary for archiver,py (at least not yet) and I am not sure why.
OK, we'll see if this thing stays up over the weekend.
Hi Dan, I think we should discuss a few points. You stated in your first comment :
[the problem] has to do with using the shell script
run-build-server.sh
to start the server. If you instead just paste these commands into the shell, the script works, and responds to/acknowledges/processes messages:
Is this still true or have you found any other information about that shell script ?
Also, in the previous comment, you point out :
However, the working copy of spb_history on staging is checked out to a local-only branch called changes-on-staging
Yes, that is my fault. We need to devise a deployment strategy that takes into account development vs. production properties. Since production properties can include sensitive information, we can't store those sensitive values in GitHub. One idea to solve this problem is as follows : (1) use the non-public repository Bioconductor/private to store these sensitive values, (2) clone a copy of Bioconductor/private
to every production build node (e.g. in *nix environments at /etc/syconfig/bioconductor
) , (3) and modify the common config.py
to load sensitive values from such a location when the environment (mode) is set to production. This would allow us to keep the version-controlled environment always set to production, and the "dirty" part could be flipping the mode to "development" when working locally or in an environment like TravisCI.
cc/ @mtmorgan @vobencha @priscian
One thing that might be worth considering (and would presumably require little effort), is upgrading from stomp.py v. 4.1.8
to stomp.py v. 4.1.9
(released about a week ago). Many of the changes in 4.1.9 target ActiveMQ, and perhaps upgrading the Stomp client would resolve some of the issues we've seen.
Done.
Whoa! @dtenenba , I thought you were on vacation today :smile: Have you deployed these changes as well or should we add that to the backlog?
Deployed. Updated PIP-dependencies-* and deployed those to staging + build nodes, plus updated to stomp.py 4.1.9. Leaving later today. ;)
Awesome, thanks @dtenenba ! We'll have to wait and see how this affects stability.
Forgot to mention that I also restarted all listeners so they are actually running the new version of stomp.py.
I believe this is resolved. While occasionally we still run into a random period where listeners are not responding and have to be restarted this is now generally rare (and I think related to system updates and reboots at RPCI) . The SPB generally stays on and connected for the majority of the time.
(initially misfiled in BBS)
So as we know the SPB has been unstable and my theory was that if the activemq broker stops, everything that listens (server.py, archiver.py, track_build_completion.py) needs to be restarted.
That may be true but it is not the whole story. This morning the spb was unresponsive again and the broker had not gone down.
I was able to reproduce/observe the problem as follows:
On staging as biocadmin, kick off a build of a test package:
While on one of the build nodes, say pkgbuild@zin2, observing the server log:
You can see that it does not indicate receipt of the message. If you stop and restart the server, and then kick off the build again, it will display receipt of the message the first time you kick off a test build but not subsequent times.
In the course of investigating this, I stumbled upon the fact that it has to do with using the shell script
run-build-server.sh
to start the server. If you instead just paste these commands into the shell, the script works, and responds to/acknowledges/processes messages:That fixes things on Linux and Mac (I'm not sure why) but the same is not true on Windows.
If I paste all the following into a command window on moscato2 (logged in as dtenenba):
The server that runs only responds the first time a test build is kicked off, not to subsequent kickoffs.
So...I have temporarily removed moscato2 from the spb builds until we can figure this out. Meanwhile I have restarted a bunch of builds that @vobencha asked me to do. If anyone else needs any builds kicked off, let me know. They will not have windows builds but it's better than nothing I guess.
Note also that on moscato2 I have changed the spb from a service to a scheduled task that starts at startup. This seems a lot easier to manage. But it's currently a work in progress because of the above issue.
If I make any further discoveries about this issue I will share them here.
@b-long @mtmorgan @hpages @priscian