Bioconductor / packagebuilder

5 stars 7 forks source link

SPB builds are failing #31

Closed b-long closed 8 years ago

b-long commented 8 years ago

The build nodes are not processing builds properly. For example, consider a recent build for @priscian . The package that needs to be built is "psygenet2r". I start the build using rerun_build, a part of spb_history (which lives here now).

On the controller node (staging.bioconductor.org)
source env/bin/activate
python -m rerun_build 1329 https://tracker.bioconductor.org/file6645/psygenet2r_0.9.0.tar.gz 

The build began normally, printing some information to the console :

(env)biocadmin@ip-172-30-0-180:~/packagebuilder (changes-on-staging)$ python -m rerun_build 1329 https://tracker.bioconductor.org/file6645/psygenet2r_0.9.0.tar.gz DEBUG: 12/21/2015 03:31:24 PM config.py - Loading configuration
INFO: 12/21/2015 03:31:24 PM config.py - Environment is set to: 'production'.
INFO: 12/21/2015 03:31:24 PM config.py - The following build nodes are enabled: ['zin2', 'moscato2'].
INFO: 12/21/2015 03:31:24 PM config.py - Finished loading configuration.
DEBUG: 12/21/2015 03:31:24 PM communication.py - The module search path is: 
['', '/home/biocadmin/spb_history/env/lib/python2.7', '/home/biocadmin/spb_history/env/lib/python2.7/plat-x86_64-linux-gnu', '/home/biocadmin/spb_history/env/lib/python2.7/lib-tk', '/home/biocadmin/spb_history/env/lib/python2.7/lib-old', '/home/biocadmin/spb_history/env/lib/python2.7/lib-dynload', '/usr/lib/python2.7', '/usr/lib/python2.7/plat-x86_64-linux-gnu', '/usr/lib/python2.7/lib-tk', '/home/biocadmin/spb_history/env/local/lib/python2.7/site-packages']
DEBUG: 12/21/2015 03:31:24 PM rerun_build.py - Received JSON object: '{"force": true, "job_id": "psygenet2r_20151221153124", "repository": "scratch", "bioc_version": "3.3", "svn_url": "https://tracker.bioconductor.org/file6645/psygenet2r_0.9.0.tar.gz", "r_version": "3.3", "client_id": "single_package_builder_autobuild:1329:psygenet2r_0.9.0.tar.gz", "time": "Mon Dec 21 2015 15:31:24 GMT-0800 (PST)"}'
DEBUG: 12/21/2015 03:31:24 PM communication.py - Attempting to open connection to ActiveMQ at 'broker.bioconductor.org:61613'.
INFO: 12/21/2015 03:31:24 PM transport.py - Attempting connection to host broker.bioconductor.org, port 61613
INFO: 12/21/2015 03:31:24 PM transport.py - Established connection to host broker.bioconductor.org, port 61613
DEBUG: 12/21/2015 03:31:24 PM rerun_build.py - on_connecting() broker.bioconductor.org 61613.
DEBUG: 12/21/2015 03:31:24 PM communication.py - Not attempting authentication
DEBUG: 12/21/2015 03:31:24 PM rerun_build.py - on_send() STOMP {'accept-version': '1.1'} .
INFO: 12/21/2015 03:31:24 PM rerun_build.py - Receipt: None
DEBUG: 12/21/2015 03:31:24 PM transport.py - Sending frame ['STOMP', '\n', 'accept-version:1.1\n', '\n', '\x00']
DEBUG: 12/21/2015 03:31:24 PM communication.py - Stomp connection established.
DEBUG: 12/21/2015 03:31:24 PM rerun_build.py - on_send() SEND {'content-length': 331, 'destination': '/topic/buildjobs'} {"force": true, "job_id": "psygenet2r_20151221153124", "repository": "scratch", "bioc_version": "3.3", "svn_url": "https://tracker.bioconductor.org/file6645/psygenet2r_0.9.0.tar.gz", "r_version": "3.3", "client_id": "single_package_builder_autobuild:1329:psygenet2r_0.9.0.tar.gz", "time": "Mon Dec 21 2015 15:31:24 GMT-0800 (PST)"}.
INFO: 12/21/2015 03:31:24 PM rerun_build.py - Receipt: None
DEBUG: 12/21/2015 03:31:24 PM transport.py - Sending frame ['SEND', '\n', 'content-length:331\n', 'destination:/topic/buildjobs\n', '\n', '{"force": true, "job_id": "psygenet2r_20151221153124", "repository": "scratch", "bioc_version": "3.3", "svn_url": "https://tracker.bioconductor.org/file6645/psygenet2r_0.9.0.tar.gz", "r_version": "3.3", "client_id": "single_package_builder_autobuild:1329:psygenet2r_0.9.0.tar.gz", "time": "Mon Dec 21 2015 15:31:24 GMT-0800 (PST)"}', '\x00']

A page was created to track the build. However, output from the build nodes is never displayed: http://staging.bioconductor.org:8000/job/552/ .

On zin2 (a build node)

Next, I began investigating the build machine. The service running on all build nodes is server.py, which prints it's output to server.log. This service regularly receives the command to start build. In the log file we see that the service is "Waiting to do work.":

pkgbuild@zin2:~/packagebuilder$ grep -F "server.py" server.log  | tail -n 11
DEBUG: 12/21/2015 12:07:24 PM server.py - main() Waiting to do work.
DEBUG: 12/21/2015 12:12:24 PM server.py - main() Waiting to do work.
DEBUG: 12/21/2015 12:17:24 PM server.py - main() Waiting to do work.
DEBUG: 12/21/2015 12:22:24 PM server.py - main() Waiting to do work.
DEBUG: 12/21/2015 12:27:24 PM server.py - main() Waiting to do work.
DEBUG: 12/21/2015 12:32:25 PM server.py - main() Waiting to do work.
DEBUG: 12/21/2015 12:37:25 PM server.py - main() Waiting to do work.
DEBUG: 12/21/2015 12:42:25 PM server.py - main() Waiting to do work.
DEBUG: 12/21/2015 12:47:25 PM server.py - main() Waiting to do work.
DEBUG: 12/21/2015 12:52:25 PM server.py - main() Waiting to do work.
DEBUG: 12/21/2015 12:57:25 PM server.py - main() Waiting to do work.

Interestingly, If I omit the main() loop in searching the log file, we see that the last time instruction was received to perform a build is days ago:

pkgbuild@zin2:~/packagebuilder$ grep -F "server.py" server.log |grep -v "main()" | tail -n 11
DEBUG: 12/18/2015 04:58:58 PM server.py - on_before_message() {'expires': '0', 'timestamp': '1450486995478', 'destination': '/topic/buildjobs', 'persistent': 'true', 'priority': '4', 'x-client': '140.107.89.113', 'message-id': 'ID:broker-46292-1448469945488-2:539:-1:1:1', 'subscription': 'ae9a6f4bb94345918d287e135e57c789'} {"force": true, "job_id": "isomiRs_20151218165858", "repository": "scratch", "bioc_version": "3.3", "svn_url": "https://tracker.bioconductor.org/file6774/isomiRs_0.99.11.tar.gz", "r_version": "3.3", "client_id": "single_package_builder_autobuild:1108:isomiRs_0.99.11.tar.gz", "time": "Fri Dec 18 2015 16:58:58 GMT-0700 (PST)"}.
INFO: 12/18/2015 04:58:58 PM server.py - Message received
DEBUG: 12/18/2015 04:58:58 PM server.py - on_message() job_dir = /home/pkgbuild/packagebuilder/workers/jobs/isomiRs_20151218165858.
INFO: 12/18/2015 04:58:58 PM server.py - job_dir: '/home/pkgbuild/packagebuilder/workers/jobs/isomiRs_20151218165858'
INFO: 12/18/2015 04:58:58 PM server.py - Attempting to run commands from directory: '/home/pkgbuild/packagebuilder'
INFO: 12/18/2015 04:58:58 PM server.py - shell_cmd: '['python', '-m', 'workers.builder', u'/home/pkgbuild/packagebuilder/workers/jobs/isomiRs_20151218165858/manifest.json', u'3.3']'
INFO: 12/18/2015 04:58:58 PM server.py - jobfilename: '/home/pkgbuild/packagebuilder/workers/jobs/isomiRs_20151218165858/manifest.json'
INFO: 12/18/2015 04:58:58 PM server.py - builder_log: '<open file u'/home/pkgbuild/packagebuilder/workers/jobs/isomiRs_20151218165858/builder.log', mode 'w' at 0x7f54838068a0>'
INFO: 12/18/2015 04:58:58 PM server.py - blderProcess: '<subprocess.Popen object at 0x7f548377f990>'
DEBUG: 12/18/2015 04:58:58 PM server.py - on_send() SEND {'content-length': 213, 'destination': '/topic/builderevents', 'persistent': 'true'} {"body": "Got build request...", "first_message": true, "job_id": "isomiRs_20151218165858", "bioc_version": "3.3", "builder_id": "zin2", "client_id": "single_package_builder_autobuild:1108:isomiRs_0.99.11.tar.gz"}.
INFO: 12/18/2015 04:58:58 PM server.py - Reply sent

I'm unsure what triggered this build, as I didn't kick it off. However, we see that it's page (in the Django application) is available and that a build completed for all nodes: http://staging.bioconductor.org:8000/job/545/ .

Looking at the running processes, we see that the server is still running. However, it's interesting that there's a defunct sub-process :

pkgbuild@zin2:~/packagebuilder$ ps -ef |grep python|grep -v grep
pkgbuild  1891 29223  0 Dec18 ?        00:00:00 [python] <defunct>
pkgbuild 29223     1  0 Dec18 ?        00:00:00 python -m workers.server
On moscato2 (another build node)

Dissimilarly, on moscato2, we see that the message to start the build is received :


DEBUG: 12/21/2015 12:18:36 PM server.py - main() Waiting to do work.
DEBUG: 12/21/2015 12:23:36 PM server.py - main() Waiting to do work.
DEBUG: 12/21/2015 12:28:36 PM server.py - main() Waiting to do work.
DEBUG: 12/21/2015 12:31:21 PM server.py - on_before_message() {'content-length': '331', 'expires': '0', 'timestamp': '1450730145147', 'destination': '/topic/buildjobs', 'priority': '4', 'message-id': 'ID:broker-46292-1448469945488-2:552:-1:1:1', 'subscription': 'c7bce2b6036c4a999926bc256ca85c66'} {"force": true, "job_id": "psygenet2r_20151221153124", "repository": "scratch", "bioc_version": "3.3", "svn_url": "https://tracker.bioconductor.org/file6645/psygenet2r_0.9.0.tar.gz", "r_version": "3.3", "client_id": "single_package_builder_autobuild:1329:psygenet2r_0.9.0.tar.gz", "time": "Mon Dec 21 2015 15:31:24 GMT-0800 (PST)"}.
INFO: 12/21/2015 12:31:21 PM server.py - Message received
DEBUG: 12/21/2015 12:31:21 PM server.py - on_message() job_dir = E:\packagebuilder\workers\jobs\psygenet2r_20151221153124.
INFO: 12/21/2015 12:31:21 PM server.py - job_dir: 'E:\packagebuilder\workers\jobs\psygenet2r_20151221153124'
INFO: 12/21/2015 12:31:21 PM server.py - Attempting to run commands from directory: 'E:\packagebuilder'
INFO: 12/21/2015 12:31:21 PM server.py - shell_cmd: '['python', '-m', 'workers.builder', u'E:\\packagebuilder\\workers\\jobs\\psygenet2r_20151221153124\\manifest.json', u'3.3']'
INFO: 12/21/2015 12:31:21 PM server.py - jobfilename: 'E:\packagebuilder\workers\jobs\psygenet2r_20151221153124\manifest.json'
INFO: 12/21/2015 12:31:21 PM server.py - builder_log: '<open file u'E:\\packagebuilder\\workers\\jobs\\psygenet2r_20151221153124\\builder.log', mode 'w' at 0x017AAA18>'
INFO: 12/21/2015 12:31:21 PM server.py - blderProcess: '<subprocess.Popen object at 0x017D8F90>'
DEBUG: 12/21/2015 12:31:21 PM server.py - on_send() SEND {'content-length': 221, 'destination': '/topic/builderevents', 'persistent': 'true'} {"body": "Got build request...", "first_message": true, "job_id": "psygenet2r_20151221153124", "bioc_version": "3.3", "builder_id": "moscato2", "client_id": "single_package_builder_autobuild:1329:psygenet2r_0.9.0.tar.gz"}.
DEBUG: 12/21/2015 12:31:21 PM transport.py - Sending frame ['SEND', '\n', 'content-length:221\n', 'destination:/topic/builderevents\n', 'persistent:true\n', '\n', '{"body": "Got build request...", "first_message": true, "job_id": "psygenet2r_20151221153124", "bioc_version": "3.3", "builder_id": "moscato2", "client_id": "single_package_builder_autobuild:1329:psygenet2r_0.9.0.tar.gz"}', '\x00']
INFO: 12/21/2015 12:31:21 PM server.py - Reply sent
DEBUG: 12/21/2015 12:31:21 PM transport.py - Received frame: 'MESSAGE', headers={'content-length': '331', 'expires': '0', 'timestamp': '1450730145147', 'destination': '/topic/buildjobs', 'priority': '4', 'message-id': 'ID:broker-46292-1448469945488-2:552:-1:1:1', 'subscription': 'c7bce2b6036c4a999926bc256ca85c66'}, body='{"force": true, "job_id": "psygenet2r_20151221153124", "repository": "scratch", "bioc_version": "3.3", "svn_url": "https://tracker.bioconductor.org/file6645/psygenet2r_0.9.0.tar.gz", "r_version": "3.3", "client_id": "single_package_builder_autobuild:1329:psygenet2r_0.9.0.tar.gz", "time": "Mon Dec 21 2015 15:31:24 GMT-0800 (PST)"}'
DEBUG: 12/21/2015 12:33:36 PM server.py - main() Waiting to do work.
DEBUG: 12/21/2015 12:38:36 PM server.py - main() Waiting to do work.
DEBUG: 12/21/2015 12:43:36 PM server.py - main() Waiting to do work.

There are no defunct processes (I'm not sure if this construct exists in Windows). Also, server.py is still running. The build seems to succeed, and we can see that by navigating to the directory E:\packagebuilder\workers\jobs\psygenet2r_20151221153124.
A log file named builder.log contains the output below :

INFO: 12/21/2015 12:40:28 PM builder.py - Normal build completion, Build successful..

Also, the file E:\packagebuilder\workers\jobs\psygenet2r_2015122115312\psygenet2r_0.9.0.tar.gz is created.

I'm beginning investigation now. It may also be worth noting that the only sub-process started directly by server.py is builder.py. However, many sub-processes are started by builder.py.

blong@work:~/Documents/Work/REPOS__git/b-long/packagebuilder$ git grep "subprocess\.call\|subprocess\.Popen" 
workers/builder.py:            description = subprocess.Popen([
workers/builder.py:    packages = subprocess.Popen(["curl", "-k", "-s", repository_url],
workers/builder.py:    retcode = subprocess.call(svn_cmd, shell=True)
workers/builder.py:    retcode = subprocess.call("tar %s -zxf %s.orig" %
workers/builder.py:    retcode = subprocess.call(cmd, shell=True)
workers/builder.py:    pope = subprocess.Popen(cmd, stdout=out_fh, stderr=subprocess.STDOUT,
workers/builder.py:    pope  = subprocess.Popen(cmd, stdout=out_fh, stderr=subprocess.STDOUT,
workers/builder.py:    svn_info = subprocess.Popen([
workers/builder.py:        retcode = subprocess.call(command)
workers/builder.py:        chmod_retcode = subprocess.call(
workers/builder.py:        retcode = subprocess.call(command)
workers/builder.py:        remote_chmod_retcode = subprocess.call(command)
workers/builder.py:        retcode = subprocess.call(command_str, shell=False, stdout=stdout_fh,
workers/builder.py:        return(subprocess.call(command_str, shell=shell))
workers/builder.py:    retcode = subprocess.call(command, shell=True)
workers/builder.py:        retcode = subprocess.call(command, shell=True)
workers/builder.py:    r_version_raw, stderr = subprocess.Popen([
workers/builder.py:            description = subprocess.Popen([
workers/server.py:            blderProcess = subprocess.Popen(shell_cmd, stdout=builder_log, stderr=builder_log)
b-long commented 8 years ago

It's unclear why the work done by server.py is suddenly not reported back to staging.

On zin2 seem to indicate messages aren't received at all ( no log output from on_before_message and the log entry "Message received" was not written).

On moscato2 messages are received, and response is sent :

DEBUG: 12/21/2015 12:31:21 PM server.py - on_send() SEND {'content-length': 221, 'destination': '/topic/builderevents', 'persistent': 'true'} {"body": "Got build request...", "first_message": true, "job_id": "psygenet2r_20151221153124", "bioc_version": "3.3", "builder_id": "moscato2", "client_id": "single_package_builder_autobuild:1329:psygenet2r_0.9.0.tar.gz"}.
DEBUG: 12/21/2015 12:31:21 PM transport.py - Sending frame ['SEND', '\n', 'content-length:221\n', 'destination:/topic/builderevents\n', 'persistent:true\n', '\n', '{"body": "Got build request...", "first_message": true, "job_id": "psygenet2r_20151221153124", "bioc_version": "3.3", "builder_id": "moscato2", "client_id": "single_package_builder_autobuild:1329:psygenet2r_0.9.0.tar.gz"}', '\x00']
INFO: 12/21/2015 12:31:21 PM server.py - Reply sent

However , track_build_completion.log on staging does not contain this message. Other instances of first_message do exist in track_build_completion.log:

biocadmin@ip-172-30-0-180:~/packagebuilder (changes-on-staging)$ grep first_message track_build_completion.log |tail -n 4
DEBUG: 12/18/2015 07:58:59 PM transport.py - Received frame: 'MESSAGE', headers={'content-length': '217', 'expires': '0', 'timestamp': '1450486995561', 'destination': '/topic/builderevents', 'persistent': 'true', 'priority': '4', 'message-id': 'ID:broker-46292-1448469945488-2:512:-1:1:7', 'subscription': 'e85cb19c7b93443b956386c3e501d5d6'}, body='{"body": "Got build request...", "first_message": true, "job_id": "isomiRs_20151218165858", "bioc_version": "3.3", "builder_id": "moscato2", "client_id": "single_package_builder_autobuild:1108:isomiRs_0.99.11.tar.gz"}'
DEBUG: 12/19/2015 04:17:59 PM track_build_completion.py - on_before_message() {'content-length': '234', 'expires': '0', 'timestamp': '1450560136609', 'destination': '/topic/builderevents', 'persistent': 'true', 'priority': '4', 'message-id': 'ID:broker-46292-1448469945488-2:542:-1:1:1', 'subscription': 'e85cb19c7b93443b956386c3e501d5d6'} {"body": "Got build request...", "first_message": true, "job_id": "miRNAmeConverter_20151219131757", "bioc_version": "3.3", "builder_id": "moscato2", "client_id": "single_package_builder_autobuild:1345:miRNAmeConverter_0.99.4.tar.gz"}.
INFO: 12/19/2015 04:17:59 PM track_build_completion.py - Received stomp message: {"body": "Got build request...", "first_message": true, "job_id": "miRNAmeConverter_20151219131757", "bioc_version": "3.3", "builder_id": "moscato2", "client_id": "single_package_builder_autobuild:1345:miRNAmeConverter_0.99.4.tar.gz"}
DEBUG: 12/19/2015 04:17:59 PM transport.py - Received frame: 'MESSAGE', headers={'content-length': '234', 'expires': '0', 'timestamp': '1450560136609', 'destination': '/topic/builderevents', 'persistent': 'true', 'priority': '4', 'message-id': 'ID:broker-46292-1448469945488-2:542:-1:1:1', 'subscription': 'e85cb19c7b93443b956386c3e501d5d6'}, body='{"body": "Got build request...", "first_message": true, "job_id": "miRNAmeConverter_20151219131757", "bioc_version": "3.3", "builder_id": "moscato2", "client_id": "single_package_builder_autobuild:1345:miRNAmeConverter_0.99.4.tar.gz"}'
b-long commented 8 years ago

The branch patch/31 is deployed to production. A new build of "psygenet2r" has run, and can be viewed here: http://staging.bioconductor.org:8000/job/553/ . One assumes succeeding builds will continue to work properly, but we should monitor the situation.

FYI @mtmorgan @jimhester @vobencha @priscian @hpages

b-long commented 8 years ago

Additional builds kicked off immediately afterwards seem to propagate the SPB without issue :

b-long commented 8 years ago

Started more investigation this morning, navigating to : http://staging.bioconductor.org:8000/job/556/ . Initially this gave a 404, indicating that 555 was the last build to execute. Next, I logged into staging and started a build using tracker ID 1343 and package https://tracker.bioconductor.org/file6757/cellTree_0.99.3.tar.gz . Similar to the symptoms seen yesterday a job page is created at http://staging.bioconductor.org:8000/job/556/ , however output from the build nodes is never displayed.

Again, I check the server.log file for both build nodes. The results seen on zin2 are similar to results as observed yesterday. However, results seen on moscato2 are now in-line with zin2 .

On zin2

The last message to build a job is from 12/21:

pkgbuild@zin2:~/packagebuilder$ grep -F "server.py" server.log |grep -v "main()" | tail -n 11
INFO: 12/21/2015 04:55:11 PM server.py - Reply sent
DEBUG: 12/21/2015 04:56:20 PM server.py - on_before_message() {'content-length': '330', 'expires': '0', 'timestamp': '1450746043759', 'destination': '/topic/buildjobs', 'priority': '4', 'message-id': 'ID:broker-46292-1448469945488-2:564:-1:1:1', 'subscription': '93160f48732c485f8173febf386e494b'} {"force": true, "job_id": "globalSeq_20151221195622", "repository": "scratch", "bioc_version": "3.3", "svn_url": "https://tracker.bioconductor.org/file6770/globalSeq_0.99.4.tar.gz", "r_version": "3.3", "client_id": "single_package_builder_autobuild:1343:globalSeq_0.99.4.tar.gz", "time": "Mon Dec 21 2015 19:56:22 GMT-0800 (PST)"}.
INFO: 12/21/2015 04:56:20 PM server.py - on_message() Message received
INFO: 12/21/2015 04:56:20 PM server.py - on_message() Message acknowledged
INFO: 12/21/2015 04:56:20 PM server.py - on_message() job_dir: '/home/pkgbuild/packagebuilder/workers/jobs/globalSeq_20151221195622'.
INFO: 12/21/2015 04:56:20 PM server.py - on_message() Attempting to run commands from directory: '/home/pkgbuild/packagebuilder'
INFO: 12/21/2015 04:56:20 PM server.py - on_message() shell_cmd: '['python', '-m', 'workers.builder', u'/home/pkgbuild/packagebuilder/workers/jobs/globalSeq_20151221195622/manifest.json', u'3.3']'
INFO: 12/21/2015 04:56:20 PM server.py - on_message() jobfilename: '/home/pkgbuild/packagebuilder/workers/jobs/globalSeq_20151221195622/manifest.json'
INFO: 12/21/2015 04:56:20 PM server.py - on_message() builder_log: '<open file u'/home/pkgbuild/packagebuilder/workers/jobs/globalSeq_20151221195622/builder.log', mode 'w' at 0x7fda2464a540>'
DEBUG: 12/21/2015 04:56:20 PM server.py - on_send() SEND {'content-length': 216, 'destination': '/topic/builderevents', 'persistent': 'true'} {"body": "Got build request...", "first_message": true, "job_id": "globalSeq_20151221195622", "bioc_version": "3.3", "builder_id": "zin2", "client_id": "single_package_builder_autobuild:1343:globalSeq_0.99.4.tar.gz"}.
INFO: 12/21/2015 04:56:20 PM server.py - Reply sent
pkgbuild@zin2:~/packagebuilder$ 

Viewing the currently running Python processes shows a defunct process:

pkgbuild@zin2:~/packagebuilder$ ps ux |grep python |grep -v grep
pkgbuild 25101  0.0  0.0 137360  1912 ?        Sl   Dec21   0:00 python -m workers.server
pkgbuild 27739  0.0  0.0      0     0 ?        Z    Dec21   0:00 [python] <defunct>
On moscato2

The last message to build a job is from 12/21:

$ grep -F "server.py" server.log |grep -v "main()" | tail -n 11
INFO: 12/21/2015 04:55:11 PM server.py - Reply sent
DEBUG: 12/21/2015 04:56:20 PM server.py - on_before_message() {'content-length': '330', 'expires': '0', 'timestamp': '1450746043759', 'destination': '/topic/buildjobs', 'priority': '4', 'mes
sage-id': 'ID:broker-46292-1448469945488-2:564:-1:1:1', 'subscription': '440222e574334ac1a9256e48b2e2c80a'} {"force": true, "job_id": "globalSeq_20151221195622", "repository": "scratch", "bi
oc_version": "3.3", "svn_url": "https://tracker.bioconductor.org/file6770/globalSeq_0.99.4.tar.gz", "r_version": "3.3", "client_id": "single_package_builder_autobuild:1343:globalSeq_0.99.4.t
ar.gz", "time": "Mon Dec 21 2015 19:56:22 GMT-0800 (PST)"}.
INFO: 12/21/2015 04:56:20 PM server.py - on_message() Message received
INFO: 12/21/2015 04:56:20 PM server.py - on_message() Message acknowledged
INFO: 12/21/2015 04:56:20 PM server.py - on_message() job_dir: 'E:\packagebuilder\workers\jobs\globalSeq_20151221195622'.
INFO: 12/21/2015 04:56:20 PM server.py - on_message() Attempting to run commands from directory: 'E:\packagebuilder'
INFO: 12/21/2015 04:56:20 PM server.py - on_message() shell_cmd: '['python', '-m', 'workers.builder', u'E:\\packagebuilder\\workers\\jobs\\globalSeq_20151221195622\\manifest.json', u'3.3']'
INFO: 12/21/2015 04:56:20 PM server.py - on_message() jobfilename: 'E:\packagebuilder\workers\jobs\globalSeq_20151221195622\manifest.json'
INFO: 12/21/2015 04:56:20 PM server.py - on_message() builder_log: '<open file u'E:\\packagebuilder\\workers\\jobs\\globalSeq_20151221195622\\builder.log', mode 'w' at 0x018FEB20>'
DEBUG: 12/21/2015 04:56:20 PM server.py - on_send() SEND {'content-length': 220, 'destination': '/topic/builderevents', 'persistent': 'true'} {"body": "Got build request...", "first_message"
: true, "job_id": "globalSeq_20151221195622", "bioc_version": "3.3", "builder_id": "moscato2", "client_id": "single_package_builder_autobuild:1343:globalSeq_0.99.4.tar.gz"}.
INFO: 12/21/2015 04:56:20 PM server.py - Reply sent

The build server is still running, and shows 3 threads: screenshot from 2015-12-22 10 17 38

screenshot from 2015-12-22 10 18 05

b-long commented 8 years ago

After the recent deployments, it seems that the SPB is continuing to drop communication capability at some point.

On zin2

It would appear that the builds actually complete (although not successfully) :

pkgbuild@zin2:~/packagebuilder$ grep -n build_package workers/jobs/JunctionSeq_20151222131237/builder.log workers/jobs/JunctionSeq_20151222133405/builder.log workers/jobs/IHW_20151222112042/builder.log
workers/jobs/JunctionSeq_20151222131237/builder.log:48:INFO: 12/22/2015 10:13:35 AM builder.py - build_package() finished with result 1
workers/jobs/JunctionSeq_20151222133405/builder.log:48:INFO: 12/22/2015 10:35:06 AM builder.py - build_package() finished with result 1
workers/jobs/IHW_20151222112042/builder.log:50:INFO: 12/22/2015 11:27:54 AM builder.py - build_package() finished with result 0
On moscato2

The same thing appears to be happening:

user@moscato2 /cygdrive/e/packagebuilder
$ grep -n build_package workers/jobs/JunctionSeq_20151222131237/builder.log workers/jobs/JunctionSeq_20151222133405/builder.log workers/jobs/IHW_20151222112042/builder.log
workers/jobs/JunctionSeq_20151222131237/builder.log:49:INFO: 12/22/2015 10:15:06 AM builder.py - build_package() finished with result 1
workers/jobs/JunctionSeq_20151222133405/builder.log:49:INFO: 12/22/2015 10:36:42 AM builder.py - build_package() finished with result 1
workers/jobs/IHW_20151222112042/builder.log:50:INFO: 12/22/2015 11:25:26 AM builder.py - build_package() finished with result 0

In both cases, server.py is still running. There are no defunct processes, and new builds continue to be received and launched :

pkgbuild@zin2:~/packagebuilder$ grep JunctionSeq_2015122212050 server.log -B 3 -A 3
DEBUG: 12/22/2015 11:53:45 AM server.py - main() Waiting to do work.
DEBUG: 12/22/2015 11:58:45 AM server.py - main() Waiting to do work.
DEBUG: 12/22/2015 12:03:45 PM server.py - main() Waiting to do work.
DEBUG: 12/22/2015 12:05:02 PM server.py - on_before_message() {'expires': '0', 'timestamp': '1450814968390', 'destination': '/topic/buildjobs', 'persistent': 'true', 'priority': '4', 'x-client': '140.107.89.113', 'message-id': 'ID:broker-46292-1448469945488-2:614:-1:1:1', 'subscription': '892b7145a09c4b36b3b802addbc7b22d'} {"force": true, "job_id": "JunctionSeq_20151222120502", "repository": "scratch", "bioc_version": "3.3", "svn_url": "https://tracker.bioconductor.org/file6784/JunctionSeq_0.99.3.tar.gz", "r_version": "3.3", "client_id": "single_package_builder_autobuild:1341:JunctionSeq_0.99.3.tar.gz", "time": "Tue Dec 22 2015 12:05:02 GMT-0700 (PST)"}.
INFO: 12/22/2015 12:05:02 PM server.py - on_message() Message received
INFO: 12/22/2015 12:05:02 PM server.py - on_message() Message acknowledged
INFO: 12/22/2015 12:05:02 PM server.py - on_message() job_dir: '/home/pkgbuild/packagebuilder/workers/jobs/JunctionSeq_20151222120502'.
INFO: 12/22/2015 12:05:02 PM server.py - on_message() Attempting to run commands from directory: '/home/pkgbuild/packagebuilder'
INFO: 12/22/2015 12:05:02 PM server.py - on_message() shell_cmd: '['python', '-m', 'workers.builder', u'/home/pkgbuild/packagebuilder/workers/jobs/JunctionSeq_20151222120502/manifest.json', u'3.3']'
INFO: 12/22/2015 12:05:02 PM server.py - on_message() jobfilename: '/home/pkgbuild/packagebuilder/workers/jobs/JunctionSeq_20151222120502/manifest.json'
INFO: 12/22/2015 12:05:02 PM server.py - on_message() builder_log: '<open file u'/home/pkgbuild/packagebuilder/workers/jobs/JunctionSeq_20151222120502/builder.log', mode 'w' at 0x7f4663cce540>'
DEBUG: 12/22/2015 12:05:02 PM server.py - on_send() SEND {'content-length': 220, 'destination': '/topic/builderevents', 'persistent': 'true'} {"body": "Got build request...", "first_message": true, "job_id": "JunctionSeq_20151222120502", "bioc_version": "3.3", "builder_id": "zin2", "client_id": "single_package_builder_autobuild:1341:JunctionSeq_0.99.3.tar.gz"}.
DEBUG: 12/22/2015 12:05:02 PM transport.py - Sending frame ['SEND', '\n', 'content-length:220\n', 'destination:/topic/builderevents\n', 'persistent:true\n', '\n', '{"body": "Got build request...", "first_message": true, "job_id": "JunctionSeq_20151222120502", "bioc_version": "3.3", "builder_id": "zin2", "client_id": "single_package_builder_autobuild:1341:JunctionSeq_0.99.3.tar.gz"}', '\x00']
INFO: 12/22/2015 12:05:02 PM server.py - on_message() Reply sent
DEBUG: 12/22/2015 12:08:45 PM server.py - main() Waiting to do work.
INFO: 12/22/2015 12:12:53 PM server.py - on_message() blderProcess finished with status 0.
INFO: 12/22/2015 12:12:53 PM server.py - on_message() builder_log closed.
DEBUG: 12/22/2015 12:12:53 PM transport.py - Received frame: 'MESSAGE', headers={'expires': '0', 'timestamp': '1450814968390', 'destination': '/topic/buildjobs', 'persistent': 'true', 'priority': '4', 'x-client': '140.107.89.113', 'message-id': 'ID:broker-46292-1448469945488-2:614:-1:1:1', 'subscription': '892b7145a09c4b36b3b802addbc7b22d'}, body='{"force": true, "job_id": "JunctionSeq_20151222120502", "repository": "scratch", "bioc_version": "3.3", "svn_url": "https://tracker.bioconductor.org/file6784/JunctionSeq_0.99.3.tar.gz", "r_version": "3.3", "client_id": "single_package_builder_autobuild:1341:JunctionSeq_0.99.3.tar.gz", "time": "Tue Dec 22 2015 12:05:02 GMT-0700 (PST)"}'
DEBUG: 12/22/2015 12:13:45 PM server.py - main() Waiting to do work.
DEBUG: 12/22/2015 12:18:45 PM server.py - main() Waiting to do work.

Perhaps the issue is just in sending information back, I'm going to start looking at that code now.

b-long commented 8 years ago

Currently, both zin2 and moscato2 send a message like :

on_send SEND {'content-length': 316, 'destination': '/topic/builderevents', 'persistent': 'true'} {"body": "builder.py exited", "status": "autoexit", "job_id": "spbtest_20151223120932", "builder_id": "zin2", "svn_url": "https://tracker.bioconductor.org/file6786/spbtest_0.99.3.tar.gz", "client_id": "single_package_builder_autobuild:1356:spbtest_0.99.3.tar.gz", "time": "2015-12-23 09:10:23.293080", "retcode": -1}

The message sent from zin2 is received while the message sent from moscato2 is not.

dtenenba commented 8 years ago

Hmm, my commit did not close the issue, but that's ok, because I wanted to expand on what I did a little bit (for @mtmorgan and @b-long ).

Not sure how it evolved. but builder.py uses an exit handler (https://docs.python.org/2/library/atexit.html) which is sort of like a finalizer or on.exit() in R, which allows you to register a 'cleanup' function that will run whenever the interpreter exits.

In builder.py, the sending of the final 'everything is done' message (the one that track_build_completion.py is waiting for before it decides that a build is done on a node) is done in this cleanup function (called onexit()).

I added a sleep at the end of the function. Previously, the call to send_message() was the last line in the function, so it was called right before exiting the interpreter. My theory is that this happened too fast somehow on windows and send_message() didn't get to complete.

Does not seem to be a issue on linux. Hard to tell if this is a real issue that should be filed somewhere. Not sure if it is a problem in python, windows, or stomp. Anyway, now those final messages are coming through on windows and builds are (so far) completing.

b-long commented 8 years ago

I've just used rerun_build.py to execute the following tests, and all have passed:

After that, @priscian used rerun_build.py to execute a build of https://tracker.bioconductor.org/issue1351 . Unfortunately, this build did not succeed . As you can see on the spb_history page (http://staging.bioconductor.org:8000/job/615/), output is missing for large portion of the zin2 build.

I've found an error in the builder.log file on zin2, which I believe is related :

on_send SEND {'content-length': 216, 'destination': '/topic/builderevents', 'persistent': 'true'} {"status": "starting_check", "body": "", "job_id": "DNAshapeR_20151230095500", "client_id": "s
ingle_package_builder_autobuild:1351:DNAshapeR_0.99.0.tar.gz", "time": "2015-12-30 06:55:50.903123", "builder_id": "zin2"}
on_send SEND {'content-length': 399, 'destination': '/topic/builderevenINFO: 12/30/2015 06:55:50 AM transport.py - Sending frame cmd='SEND' headers={'content-length': 399, 'destination': '/top
ic/builderevents', 'persistent': 'true'}
INFO: 12/30/2015 06:55:50 AM builder.py - send_message(): Message sent.
INFO: 12/30/2015 06:55:50 AM builder.py - Attempting to tail file Rcheck.out
INFO: 12/30/2015 06:55:51 AM builder.py - Attempting to send message: '{'status': 'checking', 'body': '* using log directory \xe2\x80\x98/home/pkgbuild/packagebuilder/workers/jobs/DNAshapeR_20
151230095500/DNAshapeR.Rcheck\xe2\x80\x99\n* using R Under development (unstable) (2015-12-14 r69775)\n* using platform: x86_64-pc-linux-gnu (64-bit)\n* using session charset: UTF-8\n* using o
ption \xe2\x80\x98--no-vignettes\xe2\x80\x99\n* checking for file \xe2\x80\x98DNAshapeR/DESCRIPTION\xe2\x80\x99 ... OK\n* checking extension type ... Package\n* this is package \xe2\x80\x98DNA
shapeR\xe2\x80\x99 version \xe2\x80\x980.99.0\xe2\x80\x99\n* checking package namespace information ... OK\n* checking package dependencies ...', 'sequence': 1}'
INFO: 12/30/2015 06:55:51 AM builder.py - msg is dict
INFO: 12/30/2015 06:55:51 AM builder.py - merged_dict: '{'status': 'checking', 'body': '* using log directory \xe2\x80\x98/home/pkgbuild/packagebuilder/workers/jobs/DNAshapeR_20151230095500/DN
AshapeR.Rcheck\xe2\x80\x99\n* using R Under development (unstable) (2015-12-14 r69775)\n* using platform: x86_64-pc-linux-gnu (64-bit)\n* using session charset: UTF-8\n* using option \xe2\x80\
x98--no-vignettes\xe2\x80\x99\n* checking for file \xe2\x80\x98DNAshapeR/DESCRIPTION\xe2\x80\x99 ... OK\n* checking extension type ... Package\n* this is package \xe2\x80\x98DNAshapeR\xe2\x80\
x99 version \xe2\x80\x980.99.0\xe2\x80\x99\n* checking package namespace information ... OK\n* checking package dependencies ...', 'job_id': u'DNAshapeR_20151230095500', 'client_id': u'single_
package_builder_autobuild:1351:DNAshapeR_0.99.0.tar.gz', 'time': '2015-12-30 06:55:51.305495', 'sequence': 1, 'builder_id': 'zin2'}'
Exception in thread Thread-3:
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
    self.run()
  File "/home/pkgbuild/packagebuilder/workers/builder.py", line 110, in run
    "body": bytes
  File "/home/pkgbuild/packagebuilder/workers/builder.py", line 142, in send_message
    logging.info("Final modified body: '{body}'".format(body=body))
UnicodeEncodeError: 'ascii' codec can't encode characters in position 22-24: ordinal not in range(128)

INFO: 12/30/2015 06:58:22 AM builder.py - Attempting to send message: '{'status': 'check_complete', 'body': 'Check completed with status 1', 'elapsed_time': '0:02:31.834042', 'result_code': 1,
 'warnings': True}'
INFO: 12/30/2015 06:58:22 AM builder.py - msg is dict
INFO: 12/30/2015 06:58:22 AM builder.py - merged_dict: '{'status': 'check_complete', 'body': 'Check completed with status 1', 'job_id': u'DNAshapeR_20151230095500', 'warnings': True, 'builder_
id': 'zin2', 'elapsed_time': '0:02:31.834042', 'client_id': u'single_package_builder_autobuild:1351:DNAshapeR_0.99.0.tar.gz', 'time': '2015-12-30 06:58:22.738753', 'result_code': 1}'
INFO: 12/30/2015 06:58:22 AM builder.py - Final modified body: 'Check completed with status 1'
INFO: 12/30/2015 06:58:22 AM builder.py - Ascii encoded body: 'Check completed with status 1'

I'm investigating this now.

jimhester commented 8 years ago

Brian think you just need to make your format strings unicode.

See http://stackoverflow.com/a/22320208/2055486

b-long commented 8 years ago

@jimhester I think that's right, going to make that change and deploy for test.

jimhester commented 8 years ago

Simple test case

"{}".format(u"\u0449")
# UnicodeEncodeError: 'ascii' codec can't encode character u'\u0449' in position 0: ordinal not in range(128)

u"{}".format(u"\u0449")
# u'\u0449'
b-long commented 8 years ago

Ok, that change fixed the Traceback mentioned earlier.

I've re-run 1351 and 1328 for Jim J., and both have successfully finished all stages of the SPB's work. I'm going to merge #33 and close this issue.

dtenenba commented 8 years ago

FYI, regarding the change I made earlier on this issue (adding a sleep to the atexit finalizer), I have a minimal reproducible example that does seem to indicate that this is a problem. It seems like it just started happening after converting builder.py to use the stomp (aka stomp.py) package (it used to use stompy). So I have reported this to the author of stomp:

https://github.com/jasonrbriggs/stomp.py/issues/70

The problem doesn't seem to be with the atexit stuff either. It happens anytime you exit the interpreter right after calling send(), whether it's a normal exit (last statement in script) or facilitated by atexit.

@b-long @jimhester @mtmorgan @priscian

dtenenba commented 8 years ago

So, the bug above is resolved now. Turns out you need to disconnect from the stomp client if you are going to exit the interpreter right after sending a message:

stompClient.disconnect(receipt=None)

This obviates the need for sleeping after sending the mesage.

I've done this in 214b5371fad87570ed273068897c9f8fa4afc2ab .