Closed b-long closed 8 years ago
@vobencha Latest job is now running at http://staging.bioconductor.org:8000/job/327/ , however, I do not know if the SPB is stabilized.
Multiple problems found on morelia .
The first issue found, was related to the bash shell for the pkgbuild user. Simply using bash would result in the following :
morelia:~ pkgbuild$
bash: __git_ps1: command not found
This is the result of improper handling of the git-bash-completion module see discussion on SO. In older versions of git, it seems the module may have loaded by default (perhaps by /etc/bashrc
or similar). However, now you must install it and load it in each user's ~/.bash_profile
, like so :
# Install
curl -o ~/.git-prompt.sh \
https://raw.githubusercontent.com/git/git/master/contrib/completion/git-prompt.sh
# Modify ~/.bash_profile
source ~/.git-prompt.sh
# Followed by
PS1='\u $(__git_ps1 "(%s)")\$ '
This matters, since the cron job for running server.py
invokes bash as a login shell and may have failed since bash would always return error status.
After restarting server.py
, another problem is seen in the log ( /Users/pkgbuild/packagebuilder/server.log
) :
Traceback (most recent call last):
File "server.py", line 36, in <module>
packagebuilder_home = os.environ["PACKAGEBUILDER_HOME"]
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/UserDict.py", line 23, in __getitem__
raise KeyError(key)
KeyError: 'PACKAGEBUILDER_HOME'
I notice that in the ~/.bash_profile
for pkgbuild user on zin2 , the variable PACKAGEBUILDER_HOME is unquoted, so I've mimicked that on morelia. Once again, I try launching server.py
:
Traceback (most recent call last):
File "server.py", line 36, in <module>
packagebuilder_home = os.environ["PACKAGEBUILDER_HOME"]
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/UserDict.py", line 23, in __getitem__
raise KeyError(key)
KeyError: 'PACKAGEBUILDER_HOME'
Since it fails, I realize it must not be sourcing that variable properly. So, I look for it in the environment and notice it's not there:
bash-3.2$ env |grep HOME
HOME=/Users/pkgbuild
This is fixed using a login shell :
sh-3.2# su - pkgbuild
pkgbuild $ cd packagebuilder
pkgbuild (master)$ env |grep HOME
HOME=/Users/pkgbuild
PACKAGEBUILDER_HOME=/Users/pkgbuild/packagebuilder
So, we must modify the startup process (use an init script or the like) to guarantee proper environment.
After another restart, server.py
appears to continue successfully on morelia and it begins writing log information to the path /Users/pkgbuild/packagebuilder/jobs/biomformat_20151112164117
. However, the web page does not appear updated: http://staging.bioconductor.org:8000/job/328/ . Notably, Windows and Linux have gone through build phases.
The latest builds running against the SPB are below (we'll use these to attempt further triage). They were started in the sequence listed here :
biomformat
normalize450K
It's unclear why morelia has not written any output at as of yet.
cc/ @vobencha @mtmorgan
Thanks for restarting these.
I've added some static analysis to the following PR ( #7 ) to help push us towards higher quality code, which will in turn increase stability.
cc/ @vobencha @hpages @mtmorgan @jimhester
It remains to be an issue that server.py
(which acts more like a client) on the build nodes will get itself into a zombie-like state. When (other, yet-to-be-debugged failures occurs) the python server.py
process in each build node will keep running. However, it will not do any work (manually kicking off a SPB job doesn't propogate to the builders, or at least not the logs).
When this happens, we must restart the "service" (not actually a service in Linux, but it is a service in Windows) that owns python server.py
. Here are the steps to do that :
# Note the current time
pkgbuild@zin2:~/packagebuilder$ date
Thu Nov 19 10:48:50 PST 2015
# See the process is still running
pkgbuild@zin2:~/packagebuilder$ ps -ef |grep "server.py"
pkgbuild 25096 1 0 Nov12 ? 00:00:00 python server.py
pkgbuild 31587 31546 0 10:49 pts/11 00:00:00 grep --color=auto server.py
# Look for the most recent log entry
pkgbuild@zin2:~/packagebuilder$ ls -lrt |grep server
-rwxr-xr-x 1 pkgbuild users 276 Oct 27 12:59 kill_server.py
-rw-r--r-- 1 pkgbuild users 4522 Oct 27 14:33 server.py
-rw-r--r-- 1 pkgbuild users 3936 Nov 13 07:02 server.log
# Kill the process
pkgbuild@zin2:~/packagebuilder$ kill -9 25096
# Start the process
pkgbuild@zin2:~/packagebuilder$ nohup python server.py >> server.log 2>&1 &
[1] 31591
# See the log updated
pkgbuild@zin2:~/packagebuilder$ ls -lrt |grep server
-rwxr-xr-x 1 pkgbuild users 276 Oct 27 12:59 kill_server.py
-rw-r--r-- 1 pkgbuild users 4522 Oct 27 14:33 server.py
-rw-r--r-- 1 pkgbuild users 4006 Nov 19 10:49 server.log
# Note the current time
pkgbuild (master)$ date
Thu Nov 19 10:28:16 PST 2015
# Python still running
pkgbuild (master)$ ps -ef |grep "server.py"
505 20070 1 0 Thu01PM ?? 0:00.10 python server.py
505 7430 7347 0 10:28AM ttys000 0:00.00 grep server.py
pkgbuild (master)$ pwd
/Users/pkgbuild/packagebuilder
# Nothing in the log recently
pkgbuild (master)$ ls -lrt |grep server
-rwxr-xr-x 1 pkgbuild staff 276 Oct 9 12:48 kill_server.py
-rw-r--r-- 1 pkgbuild staff 4522 Nov 12 12:46 server.py
-rw-r--r-- 1 pkgbuild staff 8107 Nov 13 07:02 server.log
# Kill the process
pkgbuild (master)$ kill -9 20070
# Restart the process
nohup python server.py >> server.log 2>&1 &
The shell command nohup python server.py ...
run on zin2
and morelia
should be turned into an /etc/init.d
service.
FWIW is kill -9
really necessary here? Would another less drastic signal work to shutdown the process?
http://unix.stackexchange.com/a/8927 comes to mind.
Don't bring out the combine harvester just to tidy up the flower pot.
D'oh!!! That's right... kill -9
may not be necessary. Completely slipped my mind to simply try kill
(I know better, just forgetful). Of course, it's possible kill
doesn't work, and then kill -9
may be necessary. :smile: I'll try kill
next time (I'm sure it'll happen again).
This appears to be fixed by https://github.com/Bioconductor/packagebuilder/pull/11 . Closing this issue.
Iinitial investigation of the SPB has found the following :
All build nodes appear to in the same (or similar) error state. Also, we’ve got various state of the code on all SPB nodes. Right now, the SPB nodes appear to execute 1 job successfully and then start failing. The next step will be to modify
server.py
to try and more detailed logging upon an error.Because of the inconsistent state, I recommend that in the future we try to use something like clusterssh (I’ve used this before) to perform deployments, or in some other way consider automation. Of course, it’s a bit of a non-starter as paths are different (and I doubt we have an ssh server running on Windows), but we can get there through the use of environment variables and potentially an ssh service on our Windows nodes.
The log version is below :
On moscato2
The packagebuilder is out of date and is missing code changes :
server.py is still running, but nothing appears in log for today :
On morelia
The packagebuilder is out of date and is missing code changes:
server.py is still running, but nothing appears in log for today :
On zin2
The packagebuilder is out of date, but not missing code changes:
server.py is still running, but nothing appears in log for today :
Although we turned up logging in ActiveMQ (running on broker.bioconductor.org), there’s nothing useful in the file
activemq@broker:/home/ubuntu# less /var/lib/activemq/main/data/<wbr>activemq.log
Additionally, there’s nothing useful in the log that backs http://staging.bioconductor.org:8000/ , which can be found at
biocadmin@staging:~/packagebuilder/spb_history/server.log
. This appears to only be logging HTTP requests of static content (css, and the generated job reports as HTML)