buildstream-migration / bst-staging

GNU Lesser General Public License v2.1
0 stars 0 forks source link

Remote-execution connection recovering (technical debt) #630

Closed Cynical-Optimist closed 4 years ago

Cynical-Optimist commented 4 years ago

See original issue on GitLab In GitLab by [Gitlab user @martinblanchard] on Sep 5, 2018, 17:04

Background

Initial remote-execution implementation (BuildStream/buildstream!626) does not handle gRPC connection failures during long-running Operation execution, nor does it tries to catch-up with Operation execution states from a potential reconnection.

The REAPI declares a WaitExecution() call that should help reopening an Operation stream given an Operation name.

Task description

Implementation should include:

Cynical-Optimist commented 4 years ago

In GitLab by [Gitlab user @martinblanchard] on Sep 5, 2018, 17:05

mentioned in merge request !626

Cynical-Optimist commented 4 years ago

In GitLab by [Gitlab user @martinblanchard] on Sep 17, 2018, 09:36

created branch mablanch/630-remote-execution-reconn

Cynical-Optimist commented 4 years ago

In GitLab by [Gitlab user @martinblanchard] on Sep 18, 2018, 08:58

changed the description

Cynical-Optimist commented 4 years ago

In GitLab by [Gitlab user @martinblanchard] on Sep 18, 2018, 08:59

mentioned in commit 213edd420539f367841743881efb3450a2c286bf

Cynical-Optimist commented 4 years ago

In GitLab by [Gitlab user @martinblanchard] on Sep 18, 2018, 14:38

mentioned in commit a1c6cba3654c2d2132042a062c4e2313e545f2ef

Cynical-Optimist commented 4 years ago

In GitLab by [Gitlab user @martinblanchard] on Sep 18, 2018, 14:54

mentioned in merge request !806

Cynical-Optimist commented 4 years ago

In GitLab by [Gitlab user @martinblanchard] on Sep 20, 2018, 14:08

assigned to [Gitlab user @martinblanchard]

Cynical-Optimist commented 4 years ago

In GitLab by [Gitlab user @jmacarthur] on Sep 20, 2018, 15:58

I'd like this task to contain some more detail on what we expect to see, and manual instructions for simulating a connection failure so we can determine whether we've improved.

Cynical-Optimist commented 4 years ago

In GitLab by [Gitlab user @martinblanchard] on Sep 24, 2018, 11:49

mentioned in commit 4e6037a0522eb6c5647199f924638c9892553962

Cynical-Optimist commented 4 years ago

In GitLab by [Gitlab user @martinblanchard] on Sep 24, 2018, 14:06

mentioned in commit a04290ad77697c22d3891e583e3b87fb91d61022

Cynical-Optimist commented 4 years ago

In GitLab by [Gitlab user @martinblanchard] on Sep 24, 2018, 15:12

I'd like this task to contain some more detail on what we expect to see, […]

I guess we expect the operation stream reopening to be quite unnoticeable if it succeeds (we could log an INFO level message though). If it fails, a standard BstError should be raised.

[…] and manual instructions for simulating a connection failure so we can determine whether we've improved.

This is more difficult and requires control over the remote execution server. I do have a very hacky branch of BuildGrid that allows simulating connection drops but I'm don't think we want to add this kind of thing to BuildGrid itself. I've mentioned on BuildStream/buildstream#629 that future testing work should include connection drop simulations. Not sure about how I can give manual instructions, it would involve hacking around BuildGrid's code...

Cynical-Optimist commented 4 years ago

In GitLab by [Gitlab user @martinblanchard] on Sep 25, 2018, 10:12

mentioned in commit d34ce36645c7771a7a4e09d9b5d6b43c7e18075b

Cynical-Optimist commented 4 years ago

In GitLab by [Gitlab user @martinblanchard] on Oct 2, 2018, 13:55

mentioned in commit 4d7f83bebdbb903a8ca9cd056a081598a4bc3bb1

Cynical-Optimist commented 4 years ago

In GitLab by [Gitlab user @martinblanchard] on Oct 2, 2018, 13:59

[Gitlab user @jmacarthur] : I've "cleaned-up" and pushed the BuildGrid bracnh containing the operation cancellation here. Here is how I proceed in order to test reconnection:

Cynical-Optimist commented 4 years ago

In GitLab by [Gitlab user @jmacarthur] on Oct 3, 2018, 10:06

When I try bgd operation status with an instance name specified, it tells me my instance name is invalid:

https://paste.gnome.org/pucyzevdd

Cynical-Optimist commented 4 years ago

In GitLab by [Gitlab user @martinblanchard] on Oct 3, 2018, 12:25

Hummm, I can't reproduce this, can you share your server configuration file?

I've rebase the branch on recent BuildGrid and here is the configuration I use:

server:
  - !channel
    port: 50051
    insecure_mode: true

description: |
  A single-instance server with on-disk storage.

instances:
  - name: ''
    description: |
      The main instance

    storages:
      - !disk-storage &main-storage
        path: !expand-path $HOME/.cache/buildgrid/storage

    services:
      - !action-cache &main-action
        storage: *main-storage
        max_cached_refs: 256
        allow_updates: true
      - !execution
        storage: *main-storage
        action_cache: *main-action

      - !cas
        storage: *main-storage
      - !reference-cache
        storage: *main-storage
        max_cached_refs: 128
      - !bytestream
        storage: *main-storage
Cynical-Optimist commented 4 years ago

In GitLab by [Gitlab user @jmacarthur] on Oct 3, 2018, 13:30

Here's my configuration:

server:
  - !channel
    port: 50051
    insecure_mode: true
    credentials:
      tls-server-key: null
      tls-server-cert: null
      tls-client-certs: null

description: |
  A single default instance

instances:
  - name: ''
    description: |
      The main server

    storages:
        - !disk-storage &main-storage
          path: /home/jimmacarthur/bs/server/repo/cas/objects

    services:
      - !action-cache &main-action
        storage: *main-storage
        max_cached_refs: 256
        allow_updates: true

      - !execution
        storage: *main-storage
        action_cache: *main-action

      - !cas
        storage: *main-storage

      - !bytestream
        storage: *main-storage

It think it's identical to yours except I specify null for the TLS keys and I don't have a reference-cache.

I've updated to the latest version of your branch, unfortunately I get yet another unrelated error with BuildGrid when I try that:

buildstream.sandbox._sandboxremote.SandboxError: Could not get action from storage. (FAILED_PRECONDITION).
Cynical-Optimist commented 4 years ago

In GitLab by [Gitlab user @martinblanchard] on Oct 3, 2018, 13:40

Am I right assuming that your BuildStream project uses a separate artifacts server?

If so, then you'd have to point BuildGrid services at it using something like:

    storages:
        - !remote-storage &main-storage
          url: "http://localhost:50052"
          instance_name: ''
          credentials:
            tls-client-key: null
            tls-client-cert: null
            tls-server-cert: null

Alternatively, you could also change your BuildStream project configuration to use BuildGrid CAS (you'd have to setup a reference-cache).

Cynical-Optimist commented 4 years ago

In GitLab by [Gitlab user @jmacarthur] on Oct 3, 2018, 13:42

That is correct, but my configuration has always previously worked.

Cynical-Optimist commented 4 years ago

In GitLab by [Gitlab user @jmacarthur] on Oct 3, 2018, 14:14

The "could not get action from storage" error is because we're both adding on 'cas/objects' - my configuration file adds it and bgd adds it so it ends up there twice. Now I'm back to the original error - it's still treating the job name as if it were an instance name.

Cynical-Optimist commented 4 years ago

In GitLab by [Gitlab user @jmacarthur] on Oct 3, 2018, 14:59

This looks suspicious:

https://gitlab.com/BuildGrid/buildgrid/blob/master/buildgrid/server/operations/service.py#L132

An empty string is a valid instance name so checking with if not instance_name looks wrong. I'm not sure where add_instance is called though.

Cynical-Optimist commented 4 years ago

In GitLab by [Gitlab user @jmacarthur] on Oct 5, 2018, 10:03

I've patched my copy of BuildGrid to use an empty instance name in all cases. bgd operation status <job id> now works correctly, but it doesn't cause the connection to drop - or at least __run_remote_command doesn't return.

Out of interest, why would bgd operation status cause the connection to drop?

Cynical-Optimist commented 4 years ago

In GitLab by [Gitlab user @martinblanchard] on Oct 5, 2018, 10:08

I've patched my copy of BuildGrid to use an empty instance name in all cases. bgd operation status <job id> now works correctly, but it doesn't cause the connection to drop - or at least __run_remote_command doesn't return.

Ok nice. I'm still not able to reproduce your problem on my machine though...

Out of interest, why would bgd operation status cause the connection to drop?

It won't, except if you are running the modified version of BuildGrid from that particular branch. It contain a hack that will force the server to cancel all opened gRPC channels for a given operation when it receive a GetOperation() call (used by bgd operation status).

Cynical-Optimist commented 4 years ago

In GitLab by [Gitlab user @martinblanchard] on Oct 22, 2018, 11:59

mentioned in commit 3669fd4451f3e8315d84b512a0737cc7257d8f93

Cynical-Optimist commented 4 years ago

In GitLab by [Gitlab user @martinblanchard] on Oct 23, 2018, 11:01

mentioned in commit 797d30106d9884a4b2688a65fa88e7204b55fce8

Cynical-Optimist commented 4 years ago

In GitLab by [Gitlab user @martinblanchard] on Oct 23, 2018, 11:54

mentioned in commit aa0cbf5dfb8e68674c60efe841cf862d6e5cdea6

Cynical-Optimist commented 4 years ago

In GitLab by [Gitlab user @martinblanchard] on Nov 1, 2018, 09:25

marked the task Handle network failures while pooling on Operation status. as completed

Cynical-Optimist commented 4 years ago

In GitLab by [Gitlab user @martinblanchard] on Nov 1, 2018, 09:25

marked the task Try to reconnect when such a failure happens. as completed

Cynical-Optimist commented 4 years ago

In GitLab by [Gitlab user @martinblanchard] on Nov 1, 2018, 09:28

Connection recovering support is now merged: BuildStream/buildstream!806.

Cynical-Optimist commented 4 years ago

In GitLab by [Gitlab user @martinblanchard] on Nov 1, 2018, 09:30

marked the task Resume Operation status polling if reconnection succeed. as completed

Cynical-Optimist commented 4 years ago

In GitLab by [Gitlab user @toscalix] on Nov 1, 2018, 09:41

closed