Bioconductor / packagebuilder

5 stars 7 forks source link

server.py stops responding after first message #35

Closed dtenenba closed 6 years ago

dtenenba commented 8 years ago

(initially misfiled in BBS)

So as we know the SPB has been unstable and my theory was that if the activemq broker stops, everything that listens (server.py, archiver.py, track_build_completion.py) needs to be restarted.

That may be true but it is not the whole story. This morning the spb was unresponsive again and the broker had not gone down.

I was able to reproduce/observe the problem as follows:

On staging as biocadmin, kick off a build of a test package:

cd packagebuilder
source env/bin/activate
python rerun_build.py 558 https://tracker.bioconductor.org/file6458/spbtest_0.99.0.tar.gz

While on one of the build nodes, say pkgbuild@zin2, observing the server log:

cd packagebuilder
tail -f server.log

You can see that it does not indicate receipt of the message. If you stop and restart the server, and then kick off the build again, it will display receipt of the message the first time you kick off a test build but not subsequent times.

In the course of investigating this, I stumbled upon the fact that it has to do with using the shell script run-build-server.sh to start the server. If you instead just paste these commands into the shell, the script works, and responds to/acknowledges/processes messages:

source env/bin/activate
. workers/static-config.sh
nohup python -m workers.server > server.log 2>&1 &

That fixes things on Linux and Mac (I'm not sure why) but the same is not true on Windows.

If I paste all the following into a command window on moscato2 (logged in as dtenenba):

call env\Scripts\activate.bat

set BBS_PYTHON_CMD=C:\Python27\python.exe
set BBS_RSYNC_CMD=C:\cygwin\bin\rsync.exe

@rem With StrictHostKeyChecking=no, ssh will automatically add new host keys
@rem to the user known hosts files (so it doesn't get stalled waiting for an
@rem answer when not run interactively).
set BBS_SSH_CMD=%BBS_SSH_CMD% -qi %BBS_RSAKEY% -o StrictHostKeyChecking=no
set BBS_RSYNC_CMD=%BBS_RSYNC_CMD% -r --delete --exclude='.svn'
set BBS_RSYNC_RSH_CMD=%BBS_RSYNC_CMD% -e '%BBS_SSH_CMD%'
set BBS_CENTRAL_BASEURL=http://zin2/BBS/%BBS_BIOC_VERSIONED_REPO_PATH%
python -m workers.server > server.log 2>&1

The server that runs only responds the first time a test build is kicked off, not to subsequent kickoffs.

So...I have temporarily removed moscato2 from the spb builds until we can figure this out. Meanwhile I have restarted a bunch of builds that @vobencha asked me to do. If anyone else needs any builds kicked off, let me know. They will not have windows builds but it's better than nothing I guess.

Note also that on moscato2 I have changed the spb from a service to a scheduled task that starts at startup. This seems a lot easier to manage. But it's currently a work in progress because of the above issue.

If I make any further discoveries about this issue I will share them here.

@b-long @mtmorgan @hpages @priscian

dtenenba commented 8 years ago

It seems that the above is still not the whole story. Even after starting the server in the manner described above, sometimes it stops receiving messages. Investigating further.

dtenenba commented 8 years ago

One annoying this is that I cannot reproduce this with the local dev setup (https://github.com/Bioconductor/packagebuilder/blob/master/documentation/Developing.md) . It only seems to happen in the live production configuration.

dtenenba commented 8 years ago

OK, update.

My next theory was that server.py, after servicing that initial message, gets into some state where it fails to see subsequent messages. I created a fork, the misnamed (or misnumbered) https://github.com/Bioconductor/packagebuilder/tree/feature/debug_issue_52 in which on_message() dispatches the bulk of its work to the do_work() function, run in a new thread. This seems to fix the issue though as we have learned it is probably too soon to say.

At any rate, I have checked out the build nodes to that branch and if it continues to work we can merge it.

The same change might need to be made to the other code that listens for messages (archiver.py and track_build_completion.py).

I guess I will leave this issue open for now. I'll close it after it's merged into master if that's what ends up happening.

dtenenba commented 8 years ago

Another update. Forgot to clarify before that windows is working again; the threading fix described above took care of it.

Also, it appeared to be necessary to update track_build_completion.py in the same way. I did so in the branch https://github.com/Bioconductor/spb_history/tree/feature/server_dispatch_to_thread . However, the working copy of spb_history on staging is checked out to a local-only branch called changes-on-staging, and I am not able to merge, so I just manually pasted the new contents of track_build_completion.py. We can resolve this in a less messy way on Monday.

So my theory overall is that after switching to the stomp.py library (which behaves fairly differently to the old stomp library--whereas the old one was single-threaded, the new one calls the callback in a different thread) the listeners always or at least often went into a non-responsive state after the first message. It seems that having the on_message() method of the listener dispatch to a new thread fixes this. It hasn't proved to be necessary for archiver,py (at least not yet) and I am not sure why.

OK, we'll see if this thing stays up over the weekend.

b-long commented 8 years ago

Hi Dan, I think we should discuss a few points. You stated in your first comment :

[the problem] has to do with using the shell script run-build-server.sh to start the server. If you instead just paste these commands into the shell, the script works, and responds to/acknowledges/processes messages:

Is this still true or have you found any other information about that shell script ?

Also, in the previous comment, you point out :

However, the working copy of spb_history on staging is checked out to a local-only branch called changes-on-staging

Yes, that is my fault. We need to devise a deployment strategy that takes into account development vs. production properties. Since production properties can include sensitive information, we can't store those sensitive values in GitHub. One idea to solve this problem is as follows : (1) use the non-public repository Bioconductor/private to store these sensitive values, (2) clone a copy of Bioconductor/private to every production build node (e.g. in *nix environments at /etc/syconfig/bioconductor) , (3) and modify the common config.py to load sensitive values from such a location when the environment (mode) is set to production. This would allow us to keep the version-controlled environment always set to production, and the "dirty" part could be flipping the mode to "development" when working locally or in an environment like TravisCI.

cc/ @mtmorgan @vobencha @priscian

b-long commented 8 years ago

One thing that might be worth considering (and would presumably require little effort), is upgrading from stomp.py v. 4.1.8 to stomp.py v. 4.1.9 (released about a week ago). Many of the changes in 4.1.9 target ActiveMQ, and perhaps upgrading the Stomp client would resolve some of the issues we've seen.

dtenenba commented 8 years ago

Done.

b-long commented 8 years ago

Whoa! @dtenenba , I thought you were on vacation today :smile: Have you deployed these changes as well or should we add that to the backlog?

dtenenba commented 8 years ago

Deployed. Updated PIP-dependencies-* and deployed those to staging + build nodes, plus updated to stomp.py 4.1.9. Leaving later today. ;)

b-long commented 8 years ago

Awesome, thanks @dtenenba ! We'll have to wait and see how this affects stability.

dtenenba commented 8 years ago

Forgot to mention that I also restarted all listeners so they are actually running the new version of stomp.py.

lshep commented 6 years ago

I believe this is resolved. While occasionally we still run into a random period where listeners are not responding and have to be restarted this is now generally rare (and I think related to system updates and reboots at RPCI) . The SPB generally stays on and connected for the majority of the time.