Bioconductor / BBS

The Bioconductor Build System
9 stars 11 forks source link

More tolerance for network errors pushing build products to primary builders #380

Open jwokaty opened 9 months ago

jwokaty commented 9 months ago

Not all build products are being sent from kjohnson3 to nebbiolo1. Checking the tail of the install-push.log:

ssh: connect to host 155.52.47.135 port 22: Network is unreachable^M
rsync: connection unexpectedly closed (0 bytes received so far) [sender]
rsync error: unexplained error (code 255) at /AppleInternal/Library/BuildRoots/d9889869-120b-11ee-b796-7a03568b17ac/Library/Caches/com.apple.xbs/Sources/rsync/rsync/io.c(453) [sender=2.6.9]
-----------------------------------------------
2023-12-04 18:45:55 -0500 (Mon, 04 Dec 2023)
nb_jobs_completed_since_last_push: 10
push command: /usr/bin/rsync --rsh 'ssh -F /Users/biocbuild/.ssh/config' -av /Users/biocbuild/bbs-3.19-bioc-mac-arm64/products-out/install/ biocbuild@nebbiolo1:/home/biocbuild/public_html/BBS/3.19/bioc-mac-arm64/products-in/kjohnson3/install

ssh: connect to host 155.52.47.135 port 22: Network is unreachable^M
rsync: connection unexpectedly closed (0 bytes received so far) [sender]
rsync error: unexplained error (code 255) at /AppleInternal/Library/BuildRoots/d9889869-120b-11ee-b796-7a03568b17ac/Library/Caches/com.apple.xbs/Sources/rsync/rsync/io.c(453) [sender=2.6.9]
-----------------------------------------------
LAST PUSH!
2023-12-04 18:45:55 -0500 (Mon, 04 Dec 2023)
nb_jobs_completed_since_last_push: 2
push command: /usr/bin/rsync --rsh 'ssh -F /Users/biocbuild/.ssh/config' -av /Users/biocbuild/bbs-3.19-bioc-mac-arm64/products-out/install/ biocbuild@nebbiolo1:/home/biocbuild/public_html/BBS/3.19/bioc-mac-arm64/products-in/kjohnson3/install

ssh: connect to host 155.52.47.135 port 22: Network is unreachable^M
rsync: connection unexpectedly closed (0 bytes received so far) [sender]
rsync error: unexplained error (code 255) at /AppleInternal/Library/BuildRoots/d9889869-120b-11ee-b796-7a03568b17ac/Library/Caches/com.apple.xbs/Sources/rsync/rsync/io.c(453) [sender=2.6.9]
-----------------------------------------------

If I run /usr/bin/rsync --rsh 'ssh -F /Users/biocbuild/.ssh/config' -av /Users/biocbuild/bbs-3.19-bioc-mac-arm64/products-out/install/ biocbuild@nebbiolo1:/home/biocbuild/public_html/BBS/3.19/bioc-mac-arm64/products-in/kjohnson3/install, I am able to push the remaining products.

On nebbiolo1, we see errors like the following in the postrun.log when this error happens:

BBS> [make_all_LeafReports] Current working dir '/home/biocbuild/public_html/BBS/3.19/bioc-mac-arm64/report'
BBS> [make_all_LeafReports] Creating report package subfolders and populating them with index.html files ... OK
BBS> [make_node_LeafReports] Node kjohnson3: BEGIN ...
Traceback (most recent call last):
  File "/home/biocbuild/BBS/BBS-report.py", line 2200, in <module>
    make_all_LeafReports(allpkgs, allpkgs_inner_rev_deps,
  File "/home/biocbuild/BBS/BBS-report.py", line 1867, in make_all_LeafReports
    make_node_LeafReports(allpkgs, node, long_link)
  File "/home/biocbuild/BBS/BBS-report.py", line 1758, in make_node_LeafReports
    make_LeafReport(leafreport_ref, allpkgs, long_link)
  File "/home/biocbuild/BBS/BBS-report.py", line 1732, in make_LeafReport
    write_Summary_asHTML(out, node_hostname, pkg, node_id, stage)
  File "/home/biocbuild/BBS/BBS-report.py", line 1382, in write_Summary_asHTML
    shutil.copyfile(filepath, dest)
  File "/usr/lib/python3.10/shutil.py", line 254, in copyfile
    with open(src, 'rb') as fsrc:
FileNotFoundError: [Errno 2] No such file or directory: '/home/biocbuild/public_html/BBS/3.19/bioc-mac-arm64/products-in/kjohnson3/install/ADaCGH2.install-summary.dcf'

I haven't looked at the code that performs the push, but maybe it needs to wait a little longer for the network disturbance to possibly resolve and try again and send a notification if after X attempts, it fails to rsync all products.

jwokaty commented 9 months ago

20231207 run log for kjohnson3:

BBS> ==============================================================
BBS>   (Re)make BBS_CENTRAL_BASEURL/products-in/kjohnson3/... OK
BBS> [STAGE2] STARTING STAGE2 at Thu Dec  7 23:16:08 2023
Traceback (most recent call last):
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/urllib/request.py", line 1346, in do_open
    h.request(req.get_method(), req.selector, req.data, headers,
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/http/client.py", line 1257, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/http/client.py", line 1303, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/http/client.py", line 1252, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/http/client.py", line 1012, in _send_output
    self.send(msg)
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/http/client.py", line 952, in send
    self.connect()
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/http/client.py", line 923, in connect
    self.sock = self._create_connection(
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/socket.py", line 843, in create_connection
    raise err
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/socket.py", line 831, in create_connection
    sock.connect(sa)
OSError: [Errno 51] Network is unreachable

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/biocbuild/BBS/BBS-run.py", line 811, in <module>
    STAGE2()
  File "/Users/biocbuild/BBS/BBS-run.py", line 423, in STAGE2
    waitForTargetRepoToBeReady()
  File "/Users/biocbuild/BBS/BBS-run.py", line 218, in waitForTargetRepoToBeReady
    f = urllib.request.urlopen(PACKAGES_url)
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/urllib/request.py", line 214, in urlopen
    return opener.open(url, data, timeout)
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/urllib/request.py", line 517, in open
    response = self._open(req, data)
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/urllib/request.py", line 534, in _open
    result = self._call_chain(self.handle_open, protocol, protocol +
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/urllib/request.py", line 494, in _call_chain
    result = func(*args)
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/urllib/request.py", line 1375, in http_open
    return self.do_open(http.client.HTTPConnection, req)
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.9/lib/python3.9/urllib/request.py", line 1349, in do_open
    raise URLError(err)
urllib.error.URLError: <urlopen error [Errno 51] Network is unreachable>
hpages commented 9 months ago

Let's first try to figure out what's going on between kjohnson3 and nebbiolo1. Communications between machines located on the internal network at DFCI has been flawless so far, so it's kind of surprising that kjohnson3 would not be able to communicate with nebbiolo1 reliably.

On our side, we could probably try to improve the situation by configuring kjohnson3 like kunpeng2 by using export BBS_PRODUCT_TRANSMISSION_MODE="none".

With this mode the machine doesn't send back the build products at all. This means rsync will no longer be needed on kjohnson3 and the machine will no longer need to use SSH keys to access the central node. Instead the central node will be in charge of retrieving the build products from kjohnson3, by calling rsync at regular intervals (e.g. every hour) like we do right now to retrieve the build products from kunpeng2.

This should be a lot more robust to network instabilities because what can't be retrieved by a call to rsync will be retrieved by a later call to rsync when the network is back.

It won't solve the waitForTargetRepoToBeReady() error that occured on Dec 7 at the beginning on STAGE2 though, but it will be a start.

But let's wait and hear what the DFCI IT folks have to say about this first.