Closed Cynical-Optimist closed 4 years ago
In GitLab by [Gitlab user @martinblanchard] on Sep 5, 2018, 17:05
mentioned in merge request !626
In GitLab by [Gitlab user @martinblanchard] on Sep 17, 2018, 09:36
created branch mablanch/630-remote-execution-reconn
In GitLab by [Gitlab user @martinblanchard] on Sep 18, 2018, 08:58
changed the description
In GitLab by [Gitlab user @martinblanchard] on Sep 18, 2018, 08:59
mentioned in commit 213edd420539f367841743881efb3450a2c286bf
In GitLab by [Gitlab user @martinblanchard] on Sep 18, 2018, 14:38
mentioned in commit a1c6cba3654c2d2132042a062c4e2313e545f2ef
In GitLab by [Gitlab user @martinblanchard] on Sep 18, 2018, 14:54
mentioned in merge request !806
In GitLab by [Gitlab user @martinblanchard] on Sep 20, 2018, 14:08
assigned to [Gitlab user @martinblanchard]
In GitLab by [Gitlab user @jmacarthur] on Sep 20, 2018, 15:58
I'd like this task to contain some more detail on what we expect to see, and manual instructions for simulating a connection failure so we can determine whether we've improved.
In GitLab by [Gitlab user @martinblanchard] on Sep 24, 2018, 11:49
mentioned in commit 4e6037a0522eb6c5647199f924638c9892553962
In GitLab by [Gitlab user @martinblanchard] on Sep 24, 2018, 14:06
mentioned in commit a04290ad77697c22d3891e583e3b87fb91d61022
In GitLab by [Gitlab user @martinblanchard] on Sep 24, 2018, 15:12
I'd like this task to contain some more detail on what we expect to see, […]
I guess we expect the operation stream reopening to be quite unnoticeable if it succeeds (we could log an INFO
level message though). If it fails, a standard BstError
should be raised.
[…] and manual instructions for simulating a connection failure so we can determine whether we've improved.
This is more difficult and requires control over the remote execution server. I do have a very hacky branch of BuildGrid that allows simulating connection drops but I'm don't think we want to add this kind of thing to BuildGrid itself. I've mentioned on BuildStream/buildstream#629 that future testing work should include connection drop simulations. Not sure about how I can give manual instructions, it would involve hacking around BuildGrid's code...
In GitLab by [Gitlab user @martinblanchard] on Sep 25, 2018, 10:12
mentioned in commit d34ce36645c7771a7a4e09d9b5d6b43c7e18075b
In GitLab by [Gitlab user @martinblanchard] on Oct 2, 2018, 13:55
mentioned in commit 4d7f83bebdbb903a8ca9cd056a081598a4bc3bb1
In GitLab by [Gitlab user @martinblanchard] on Oct 2, 2018, 13:59
[Gitlab user @jmacarthur] : I've "cleaned-up" and pushed the BuildGrid bracnh containing the operation cancellation here. Here is how I proceed in order to test reconnection:
bgd operation status <operation-name>
to simulate a connection drop.In GitLab by [Gitlab user @jmacarthur] on Oct 3, 2018, 10:06
When I try bgd operation status
with an instance name specified, it tells me my instance name is invalid:
In GitLab by [Gitlab user @martinblanchard] on Oct 3, 2018, 12:25
Hummm, I can't reproduce this, can you share your server configuration file?
I've rebase the branch on recent BuildGrid and here is the configuration I use:
server:
- !channel
port: 50051
insecure_mode: true
description: |
A single-instance server with on-disk storage.
instances:
- name: ''
description: |
The main instance
storages:
- !disk-storage &main-storage
path: !expand-path $HOME/.cache/buildgrid/storage
services:
- !action-cache &main-action
storage: *main-storage
max_cached_refs: 256
allow_updates: true
- !execution
storage: *main-storage
action_cache: *main-action
- !cas
storage: *main-storage
- !reference-cache
storage: *main-storage
max_cached_refs: 128
- !bytestream
storage: *main-storage
In GitLab by [Gitlab user @jmacarthur] on Oct 3, 2018, 13:30
Here's my configuration:
server:
- !channel
port: 50051
insecure_mode: true
credentials:
tls-server-key: null
tls-server-cert: null
tls-client-certs: null
description: |
A single default instance
instances:
- name: ''
description: |
The main server
storages:
- !disk-storage &main-storage
path: /home/jimmacarthur/bs/server/repo/cas/objects
services:
- !action-cache &main-action
storage: *main-storage
max_cached_refs: 256
allow_updates: true
- !execution
storage: *main-storage
action_cache: *main-action
- !cas
storage: *main-storage
- !bytestream
storage: *main-storage
It think it's identical to yours except I specify null for the TLS keys and I don't have a reference-cache
.
I've updated to the latest version of your branch, unfortunately I get yet another unrelated error with BuildGrid when I try that:
buildstream.sandbox._sandboxremote.SandboxError: Could not get action from storage. (FAILED_PRECONDITION).
In GitLab by [Gitlab user @martinblanchard] on Oct 3, 2018, 13:40
Am I right assuming that your BuildStream project uses a separate artifacts server?
If so, then you'd have to point BuildGrid services at it using something like:
storages:
- !remote-storage &main-storage
url: "http://localhost:50052"
instance_name: ''
credentials:
tls-client-key: null
tls-client-cert: null
tls-server-cert: null
Alternatively, you could also change your BuildStream project configuration to use BuildGrid CAS (you'd have to setup a reference-cache
).
In GitLab by [Gitlab user @jmacarthur] on Oct 3, 2018, 13:42
That is correct, but my configuration has always previously worked.
In GitLab by [Gitlab user @jmacarthur] on Oct 3, 2018, 14:14
The "could not get action from storage" error is because we're both adding on 'cas/objects' - my configuration file adds it and bgd adds it so it ends up there twice. Now I'm back to the original error - it's still treating the job name as if it were an instance name.
In GitLab by [Gitlab user @jmacarthur] on Oct 3, 2018, 14:59
This looks suspicious:
https://gitlab.com/BuildGrid/buildgrid/blob/master/buildgrid/server/operations/service.py#L132
An empty string is a valid instance name so checking with if not instance_name
looks wrong. I'm not sure where add_instance is called though.
In GitLab by [Gitlab user @jmacarthur] on Oct 5, 2018, 10:03
I've patched my copy of BuildGrid to use an empty instance name in all cases. bgd operation status <job id>
now works correctly, but it doesn't cause the connection to drop - or at least __run_remote_command
doesn't return.
Out of interest, why would bgd operation status
cause the connection to drop?
In GitLab by [Gitlab user @martinblanchard] on Oct 5, 2018, 10:08
I've patched my copy of BuildGrid to use an empty instance name in all cases.
bgd operation status <job id>
now works correctly, but it doesn't cause the connection to drop - or at least__run_remote_command
doesn't return.
Ok nice. I'm still not able to reproduce your problem on my machine though...
Out of interest, why would
bgd operation status
cause the connection to drop?
It won't, except if you are running the modified version of BuildGrid from that particular branch. It contain a hack that will force the server to cancel all opened gRPC channels for a given operation when it receive a GetOperation()
call (used by bgd operation status
).
In GitLab by [Gitlab user @martinblanchard] on Oct 22, 2018, 11:59
mentioned in commit 3669fd4451f3e8315d84b512a0737cc7257d8f93
In GitLab by [Gitlab user @martinblanchard] on Oct 23, 2018, 11:01
mentioned in commit 797d30106d9884a4b2688a65fa88e7204b55fce8
In GitLab by [Gitlab user @martinblanchard] on Oct 23, 2018, 11:54
mentioned in commit aa0cbf5dfb8e68674c60efe841cf862d6e5cdea6
In GitLab by [Gitlab user @martinblanchard] on Nov 1, 2018, 09:25
marked the task Handle network failures while pooling on Operation
status. as completed
In GitLab by [Gitlab user @martinblanchard] on Nov 1, 2018, 09:25
marked the task Try to reconnect when such a failure happens. as completed
In GitLab by [Gitlab user @martinblanchard] on Nov 1, 2018, 09:28
Connection recovering support is now merged: BuildStream/buildstream!806.
In GitLab by [Gitlab user @martinblanchard] on Nov 1, 2018, 09:30
marked the task Resume Operation
status polling if reconnection succeed. as completed
In GitLab by [Gitlab user @toscalix] on Nov 1, 2018, 09:41
closed
See original issue on GitLab In GitLab by [Gitlab user @martinblanchard] on Sep 5, 2018, 17:04
Background
Initial remote-execution implementation (BuildStream/buildstream!626) does not handle gRPC connection failures during long-running
Operation
execution, nor does it tries to catch-up withOperation
execution states from a potential reconnection.The REAPI declares a
WaitExecution()
call that should help reopening anOperation
stream given anOperation
name.Task description
Implementation should include:
Operation
status.Operation
status polling if reconnection succeed.