Bioconductor / packagebuilder

5 stars 7 forks source link

SPB failing after executing one package #6

Closed b-long closed 8 years ago

b-long commented 8 years ago

Iinitial investigation of the SPB has found the following :

All build nodes appear to in the same (or similar) error state. Also, we’ve got various state of the code on all SPB nodes. Right now, the SPB nodes appear to execute 1 job successfully and then start failing.  The next step will be to modify server.py to try and more detailed logging upon an error.

Because of the inconsistent state, I recommend that in the future we try to use something like clusterssh (I’ve used this before) to perform deployments, or in some other way consider automation. Of course, it’s a bit of a non-starter as paths are different (and I doubt we have an ssh server running on Windows), but we can get there through the use of environment variables and potentially an ssh service on our Windows nodes.

The log version is below :

On moscato2

The packagebuilder is out of date and is missing code changes :

Missing code changes

server.py is still running, but nothing appears in log for today :

server.py still running

On morelia

The packagebuilder is out of date and is missing code changes:

    morelia:packagebuilder pkgbuild$ git br
    * master
      tmp
    morelia:packagebuilder pkgbuild$ git diff --name-only HEAD..origin/master
    README.md
    README.txt
    spb_history/track_build_completion.py
    spb_history/viewhistory/helper.py
    workers/builder.py
    workers/installPkgDeps.R
    workers/server.py
    workers/zin2.sh

server.py is still running, but nothing appears in log for today :

    morelia:packagebuilder pkgbuild$ date
    Thu Nov 12 10:46:49 PST 2015
    morelia:packagebuilder pkgbuild$ ls -lrt |grep server
    -rwxr-xr-x   1 pkgbuild  staff    276 Oct  9 12:48 kill_server.py
    -rw-r--r--   1 pkgbuild  staff   4504 Oct 20 12:03 server.py
    -rw-r--r--   1 pkgbuild  staff   3441 Nov 10 14:19 server.log

On zin2

The packagebuilder is out of date, but not missing code changes:

    pkgbuild@zin2:~/packagebuilder$ git br
    * master
    pkgbuild@zin2:~/packagebuilder$ git diff --name-only HEAD..origin/master
    README.md
    README.txt

server.py is still running, but nothing appears in log for today :

    # Processes
    pkgbuild@zin2:~/packagebuilder$ ps -ef |grep server.py
    pkgbuild 23035 18893  0 09:35 pts/4    00:00:00 grep --color=auto server.py
    pkgbuild 32364     1  0 Nov10 ?        00:00:00 python server.py

    # Log files
    pkgbuild@zin2:~/packagebuilder$ date
    Thu Nov 12 09:36:06 PST 2015
    pkgbuild@zin2:~/packagebuilder$ ls -lrt |grep server
    -rwxr-xr-x    1 pkgbuild users    276 Oct 27 12:59 kill_server.py
    -rw-r--r--    1 pkgbuild users   4522 Oct 27 14:33 server.py
    -rw-r--r--    1 pkgbuild users    614 Nov 10 14:19 server.log

Although we turned up logging in ActiveMQ (running on broker.bioconductor.org), there’s nothing useful in the file activemq@broker:/home/ubuntu# less /var/lib/activemq/main/data/<wbr>activemq.log

Additionally, there’s nothing useful in the log that backs http://staging.bioconductor.org:8000/ , which can be found at biocadmin@staging:~/packagebuilder/spb_history/server.log .  This appears to only be logging HTTP requests of static content (css, and the generated job reports as HTML)

b-long commented 8 years ago

@vobencha Latest job is now running at http://staging.bioconductor.org:8000/job/327/ , however, I do not know if the SPB is stabilized.

b-long commented 8 years ago

Multiple problems found on morelia .

Issue 1

The first issue found, was related to the bash shell for the pkgbuild user. Simply using bash would result in the following :

morelia:~ pkgbuild$ 
bash: __git_ps1: command not found

This is the result of improper handling of the git-bash-completion module see discussion on SO. In older versions of git, it seems the module may have loaded by default (perhaps by /etc/bashrc or similar). However, now you must install it and load it in each user's ~/.bash_profile, like so :

# Install
curl -o ~/.git-prompt.sh \
    https://raw.githubusercontent.com/git/git/master/contrib/completion/git-prompt.sh

# Modify ~/.bash_profile
source ~/.git-prompt.sh
# Followed by
PS1='\u $(__git_ps1 "(%s)")\$ '

This matters, since the cron job for running server.py invokes bash as a login shell and may have failed since bash would always return error status.

Issue 2

After restarting server.py, another problem is seen in the log ( /Users/pkgbuild/packagebuilder/server.log ) :

Traceback (most recent call last):
  File "server.py", line 36, in <module>
    packagebuilder_home = os.environ["PACKAGEBUILDER_HOME"]
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/UserDict.py", line 23, in __getitem__
    raise KeyError(key)
KeyError: 'PACKAGEBUILDER_HOME'

I notice that in the ~/.bash_profile for pkgbuild user on zin2 , the variable PACKAGEBUILDER_HOME is unquoted, so I've mimicked that on morelia. Once again, I try launching server.py :

Traceback (most recent call last):
  File "server.py", line 36, in <module>
    packagebuilder_home = os.environ["PACKAGEBUILDER_HOME"]
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/UserDict.py", line 23, in __getitem__
    raise KeyError(key)
KeyError: 'PACKAGEBUILDER_HOME'

Since it fails, I realize it must not be sourcing that variable properly. So, I look for it in the environment and notice it's not there:

bash-3.2$ env |grep HOME
HOME=/Users/pkgbuild

This is fixed using a login shell :

sh-3.2# su - pkgbuild
pkgbuild $ cd packagebuilder
pkgbuild (master)$ env |grep HOME
HOME=/Users/pkgbuild
PACKAGEBUILDER_HOME=/Users/pkgbuild/packagebuilder

So, we must modify the startup process (use an init script or the like) to guarantee proper environment.

After another restart, server.py appears to continue successfully on morelia and it begins writing log information to the path /Users/pkgbuild/packagebuilder/jobs/biomformat_20151112164117. However, the web page does not appear updated: http://staging.bioconductor.org:8000/job/328/ . Notably, Windows and Linux have gone through build phases.

b-long commented 8 years ago

The latest builds running against the SPB are below (we'll use these to attempt further triage). They were started in the sequence listed here :

biomformat

normalize450K

It's unclear why morelia has not written any output at as of yet.

cc/ @vobencha @mtmorgan

vobencha commented 8 years ago

Thanks for restarting these.

b-long commented 8 years ago

I've added some static analysis to the following PR ( #7 ) to help push us towards higher quality code, which will in turn increase stability.

b-long commented 8 years ago

cc/ @vobencha @hpages @mtmorgan @jimhester

It remains to be an issue that server.py (which acts more like a client) on the build nodes will get itself into a zombie-like state. When (other, yet-to-be-debugged failures occurs) the python server.py process in each build node will keep running. However, it will not do any work (manually kicking off a SPB job doesn't propogate to the builders, or at least not the logs).

When this happens, we must restart the "service" (not actually a service in Linux, but it is a service in Windows) that owns python server.py. Here are the steps to do that :

Moscato2

  1. Login as Administrator
  2. Open "Server Manager" GUI
  3. Open "Services"
  4. Find "BioC Single Package Builder"
  5. Right click -> Restart

Zin2

# Note the current time
pkgbuild@zin2:~/packagebuilder$ date
Thu Nov 19 10:48:50 PST 2015
# See the process is still running
pkgbuild@zin2:~/packagebuilder$ ps -ef |grep "server.py"
pkgbuild 25096     1  0 Nov12 ?        00:00:00 python server.py
pkgbuild 31587 31546  0 10:49 pts/11   00:00:00 grep --color=auto server.py
# Look for the most recent log entry
pkgbuild@zin2:~/packagebuilder$ ls -lrt |grep server
-rwxr-xr-x    1 pkgbuild users    276 Oct 27 12:59 kill_server.py
-rw-r--r--    1 pkgbuild users   4522 Oct 27 14:33 server.py
-rw-r--r--    1 pkgbuild users   3936 Nov 13 07:02 server.log
# Kill the process
pkgbuild@zin2:~/packagebuilder$ kill -9 25096
# Start the process
pkgbuild@zin2:~/packagebuilder$ nohup python server.py >> server.log 2>&1 &
[1] 31591
# See the log updated
pkgbuild@zin2:~/packagebuilder$ ls -lrt |grep server
-rwxr-xr-x    1 pkgbuild users    276 Oct 27 12:59 kill_server.py
-rw-r--r--    1 pkgbuild users   4522 Oct 27 14:33 server.py
-rw-r--r--    1 pkgbuild users   4006 Nov 19 10:49 server.log

Morelia

# Note the current time
pkgbuild (master)$ date
Thu Nov 19 10:28:16 PST 2015
# Python still running
pkgbuild (master)$ ps -ef |grep "server.py"
  505 20070     1   0 Thu01PM ??         0:00.10 python server.py
  505  7430  7347   0 10:28AM ttys000    0:00.00 grep server.py
pkgbuild (master)$ pwd
/Users/pkgbuild/packagebuilder
# Nothing in the log recently
pkgbuild (master)$ ls -lrt |grep server
-rwxr-xr-x   1 pkgbuild  staff    276 Oct  9 12:48 kill_server.py
-rw-r--r--   1 pkgbuild  staff   4522 Nov 12 12:46 server.py
-rw-r--r--   1 pkgbuild  staff   8107 Nov 13 07:02 server.log
# Kill the process
pkgbuild (master)$ kill -9 20070
# Restart the process
nohup python server.py >> server.log 2>&1 &
TODO

The shell command nohup python server.py ... run on zin2 and morelia should be turned into an /etc/init.d service.

jimhester commented 8 years ago

FWIW is kill -9 really necessary here? Would another less drastic signal work to shutdown the process?

http://unix.stackexchange.com/a/8927 comes to mind.

Don't bring out the combine harvester just to tidy up the flower pot.

b-long commented 8 years ago

D'oh!!! That's right... kill -9 may not be necessary. Completely slipped my mind to simply try kill (I know better, just forgetful). Of course, it's possible kill doesn't work, and then kill -9 may be necessary. :smile: I'll try kill next time (I'm sure it'll happen again).

b-long commented 8 years ago

This appears to be fixed by https://github.com/Bioconductor/packagebuilder/pull/11 . Closing this issue.