cloudfoundry / diego-release

BOSH Release for Diego
Apache License 2.0
201 stars 212 forks source link

[Locket client] gRPC Client improvement #627

Closed klapkov closed 1 year ago

klapkov commented 2 years ago

[Locket client] gRPC Client improvement

Summary

Recently we did a research on how will diego perform with network delay introduced on a landscape. The tool that we used was turbulence-bosh and basically simulated n seconds delays of every request on a certain vm from the cf-deployment. A detailed summary of the results can be found here. I suggest quickly going over it, because different turbulence scenarios will be referenced multiple times.

One of the weak points we found during the research was in the locket client. Since it is used by bbs, auctioneer and the cells for maintaining presence and locks, it's a really important part of diego. The first case is when we introduce a network delay on one of the diego-api vms ( bbs can be active or passive, the result is the same ). All the cells that are using the locket server on that diego-api loose their presence. When a rep process is started on a cell, the locket client establishes a gRPC connection to a server and that connection does not close. So when we introduce turbulence, the requests in the server take too long to be processed and the TTL runs out. The expected behaviour in this case is that the client will see that the server is taking too long and close the connection so it can re-connect to another server. But that does not happen. For more details, see Scenario 1 in the doc.

The other problem we found is with the bbs and auctioneer. Since they also use a locket client to acquire a lock and maintain it, we can see the same behaviour here. If the active auctioneer is connected to a locket server, which we target with turbulence. It will loose the lock and a failover will occur, even if the auctioneer itself is fine and the attack is on a completely different vm. In this case we again expect the client to become aware that the server is responding slowly and close the connection. For more details, see Scenarios 2 and 3 in the doc.

Diego repo

Locket repo: https://github.com/cloudfoundry/locket

Describe alternatives you've considered (optional)

We currently have the gRPC client setup like this:

conn, err := grpc.Dial(
        config.LocketAddress,
        grpc.WithTransportCredentials(credentials.NewTLS(locketTLSConfig)),
        grpc.WithDialer(func(addr string, _ time.Duration) (net.Conn, error) {
            return net.DialTimeout("tcp", addr, 10*time.Second) // give at least 2 seconds per ip address (assuming there are at most 5)
        }),
        grpc.WithBlock(),
        grpc.WithTimeout(10*time.Second), // ensure that grpc won't keep retrying forever
        grpc.WithKeepaliveParams(keepalive.ClientParameters{
            Time: 10 * time.Second,

We have a keepalive ping that is send to the server every 10s, but we do not have a timeout for that ping. If we set a timeout to 3s for example, the client will send a ping to the server and if it is not able to respond in those 3s, the connection will be closed and the client will connect to another server. For more info: https://github.com/grpc/grpc/blob/master/doc/keepalive.md We have tested this on a dev landscape and the results are great. The cells are able to connect to a healthy locket server and not loose their presence. The same applies to the bbs and auctioneer. No lost locks and no unneeded failovers. So in our opinion, this will greatly improve how diego performs, when we have network delays.

Additional Text Output, Screenshots, or contextual information (optional)

I think it is a good idea to make those properties of the gRPC client ( Time and Timeout )part of the locket config, so the client behaviour can be tuned for the use case.

geofffranks commented 1 year ago

@klapkov Sorry for the delay in getting back to you on this. I'm going to try and summarize what I think you're saying the problem is + solution you're proposing. Let me know if I have this correct:

Problem

  1. Rep establishes a presence with Locket. BBS + Auctioneer establish a Lock with Locket.
  2. Every 10s Rep, BBS, and Auctioneer submit keepalive messages to Locket, with a 20s timeout (default).
  3. Network interruption occurs
  4. Since the keepalive timeout on the client side is longer than the interval, it keeps sending keepalives despite not getting any responses. Is this correct?
  5. Rep on the client side never thinks it loses presence, and never tries to gain presence with a new locket instance. However, from locket + the rest of the architecture's perspective, presence was lost and not regained.
  6. Auctioneer and BBS suffer from problems where they failover if the locket instance they're talking to fails to respond, when they could instead reconnect to a local and functioning locket instance.

Proposed Solution

  1. Add a keepalive timeout for the keepalive messages, to trigger locket clients to try to re-establish their lock/presence with a different locket server, without restarting + triggering failover.

Questions

  1. Is the above summary correct?
  2. Have you tested scenarios where communication between entire AZs is severed for >180s? How do the components behave in this scenario? I'm worried of unintended consequences of this change leading to split-brained Auctioneer/BBS instances.
klapkov commented 1 year ago

Hello @geofffranks,

Yes, your summary is correct. Having this keepalive timeout greatly improves the performance of diego under turbulence. As to your concerns, I try to test this scenario in the next days and will come back with the results summarised. Will keep in touch.

mariash commented 1 year ago

I found this story which seems like it resolved this issue - https://www.pivotaltracker.com/n/projects/1003146/stories/161765510 After adding the Time (10s) keep-alive parameter client was trying to reconnect within ~20s.

I just tested this behavior in my environment:

Based on GRPC docs and go grpc client timeout should be set for each connection.

Could you provide the locket/rep logs to see timeouts/retries timestamps?

klapkov commented 1 year ago

Hello @mariash and @geofffranks

I collected some logs from test runs with and without the timeout.

This are the logs from test ran without a timeout:

REP:

{"timestamp":"2022-11-21T12:26:06.447555356Z","level":"error","source":"rep","message":"rep.locket-lock.lost-lock","data":{"duration":15000731674,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded","lock":{"key":"ae99fb3f-f012-46f4-a853-abef12a9c1d5","owner":"9a998083-c45f-4f28-5fda-3ff8abd12659","value":"{\"cell_id\":\"ae99fb3f-f012-46f4-a853-abef12a9c1d5\",\"rep_address\":\"http://10.0.137.1:1800\",\"zone\":\"z2\",\"capacity\":{\"memory_mb\":31843,\"disk_mb\":63036,\"containers\":249},\"rootfs_provider_list\":[{\"name\":\"preloaded\",\"properties\":[\"cflinuxfs3\"]},{\"name\":\"preloaded+layer\",\"properties\":[\"cflinuxfs3\"]},{\"name\":\"docker\"}],\"rep_url\":\"https://ae99fb3f-f012-46f4-a853-abef12a9c1d5.cell.service.cf.internal:1801\"}","type":"presence","type_code":2},"request-uuid":"54b77c36-76fa-4c5c-5ef7-ade0e914b11d","session":"12","ttl_in_seconds":15}}

{"timestamp":"2022-11-21T12:26:26.449765102Z","level":"error","source":"rep","message":"rep.locket-lock.failed-to-acquire-lock","data":{"duration":15000946956,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded","lock":{"key":"ae99fb3f-f012-46f4-a853-abef12a9c1d5","owner":"9a998083-c45f-4f28-5fda-3ff8abd12659","value":"{\"cell_id\":\"ae99fb3f-f012-46f4-a853-abef12a9c1d5\",\"rep_address\":\"http://10.0.137.1:1800\",\"zone\":\"z2\",\"capacity\":{\"memory_mb\":31843,\"disk_mb\":63036,\"containers\":249},\"rootfs_provider_list\":[{\"name\":\"preloaded\",\"properties\":[\"cflinuxfs3\"]},{\"name\":\"preloaded+layer\",\"properties\":[\"cflinuxfs3\"]},{\"name\":\"docker\"}],\"rep_url\":\"https://ae99fb3f-f012-46f4-a853-abef12a9c1d5.cell.service.cf.internal:1801\"}","type":"presence","type_code":2},"request-uuid":"00c4cddb-1d53-4fe0-57bc-7a88f4954bd8","session":"12","ttl_in_seconds":15}}

{"timestamp":"2022-11-21T12:26:46.451064052Z","level":"error","source":"rep","message":"rep.locket-lock.failed-to-acquire-lock","data":{"duration":15000096402,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded","lock":{"key":"ae99fb3f-f012-46f4-a853-abef12a9c1d5","owner":"9a998083-c45f-4f28-5fda-3ff8abd12659","value":"{\"cell_id\":\"ae99fb3f-f012-46f4-a853-abef12a9c1d5\",\"rep_address\":\"http://10.0.137.1:1800\",\"zone\":\"z2\",\"capacity\":{\"memory_mb\":31843,\"disk_mb\":63036,\"containers\":249},\"rootfs_provider_list\":[{\"name\":\"preloaded\",\"properties\":[\"cflinuxfs3\"]},{\"name\":\"preloaded+layer\",\"properties\":[\"cflinuxfs3\"]},{\"name\":\"docker\"}],\"rep_url\":\"https://ae99fb3f-f012-46f4-a853-abef12a9c1d5.cell.service.cf.internal:1801\"}","type":"presence","type_code":2},"request-uuid":"a56c9e68-6a47-4983-7f2a-18221a9e295c","session":"12","ttl_in_seconds":15}}

{"timestamp":"2022-11-21T12:26:51.458892178Z","level":"info","source":"rep","message":"rep.locket-lock.acquired-lock","data":{"lock":{"key":"ae99fb3f-f012-46f4-a853-abef12a9c1d5","owner":"9a998083-c45f-4f28-5fda-3ff8abd12659","value":"{\"cell_id\":\"ae99fb3f-f012-46f4-a853-abef12a9c1d5\",\"rep_address\":\"http://10.0.137.1:1800\",\"zone\":\"z2\",\"capacity\":{\"memory_mb\":31843,\"disk_mb\":63036,\"containers\":249},\"rootfs_provider_list\":[{\"name\":\"preloaded\",\"properties\":[\"cflinuxfs3\"]},{\"name\":\"preloaded+layer\",\"properties\":[\"cflinuxfs3\"]},{\"name\":\"docker\"}],\"rep_url\":\"https://ae99fb3f-f012-46f4-a853-abef12a9c1d5.cell.service.cf.internal:1801\"}","type":"presence","type_code":2},"session":"12","ttl_in_seconds":15}}

LOCKET:

{"timestamp":"2022-11-21T12:26:06.447448689Z","level":"error","source":"locket","message":"locket.lock.lock.failed-to-fetch-lock","data":{"error":"context deadline exceeded","key":"ae99fb3f-f012-46f4-a853-abef12a9c1d5","owner":"9a998083-c45f-4f28-5fda-3ff8abd12659","request-uuid":"54b77c36-76fa-4c5c-5ef7-ade0e914b11d","session":"55816.1","type":"presence","type-code":2}}

{"timestamp":"2022-11-21T12:26:06.447573998Z","level":"error","source":"locket","message":"locket.lock.failed-locking-lock","data":{"error":"context deadline exceeded","key":"ae99fb3f-f012-46f4-a853-abef12a9c1d5","owner":"9a998083-c45f-4f28-5fda-3ff8abd12659","request-uuid":"54b77c36-76fa-4c5c-5ef7-ade0e914b11d","session":"55816"}}

{"timestamp":"2022-11-21T12:26:06.449113008Z","level":"info","source":"locket","message":"locket.context-deadline-exceeded","data":{"request-id":"","request-type":"Lock","resource-key":"ae99fb3f-f012-46f4-a853-abef12a9c1d5","resource-owner":"9a998083-c45f-4f28-5fda-3ff8abd12659"}}

{"timestamp":"2022-11-21T12:26:51.458383503Z","level":"info","source":"locket","message":"locket.lock.lock.acquired-lock","data":{"key":"ae99fb3f-f012-46f4-a853-abef12a9c1d5","owner":"9a998083-c45f-4f28-5fda-3ff8abd12659","request-uuid":"e0837ac8-df26-4dd3-42a8-1afe361c8efd","session":"55825.1","type":"presence","type-code":2}}

And here are the logs from the rep with a timeout of 3 seconds set to the locket gRPC client:

{"timestamp":"2022-11-21T17:09:50.364039016Z","level":"error","source":"rep","message":"rep.locket-lock.lost-lock","data":{"duration":3422263864,"error":"rpc error: code = Unavailable desc = transport is closing","lock":{"key":"b1045897-55c0-413f-91b7-1fa54bfadf21","owner":"ec2868e3-5bf9-4812-6ee3-3f2a9afefacc","value":"{\"cell_id\":\"b1045897-55c0-413f-91b7-1fa54bfadf21\",\"rep_address\":\"http://10.0.73.2:1800\",\"zone\":\"z1\",\"capacity\":{\"memory_mb\":31843,\"disk_mb\":62735,\"containers\":249},\"rootfs_provider_list\":[{\"name\":\"preloaded\",\"properties\":[\"cflinuxfs3\"]},{\"name\":\"preloaded+layer\",\"properties\":[\"cflinuxfs3\"]},{\"name\":\"docker\"}],\"rep_url\":\"https://b1045897-55c0-413f-91b7-1fa54bfadf21.cell.service.cf.internal:1801\"}","type":"presence","type_code":2},"request-uuid":"41c80825-506f-4143-4eae-3f2bc04f8dc4","session":"12","ttl_in_seconds":15}}

{"timestamp":"2022-11-21T17:09:55.375832924Z","level":"info","source":"rep","message":"rep.locket-lock.acquired-lock","data":{"lock":{"key":"b1045897-55c0-413f-91b7-1fa54bfadf21","owner":"ec2868e3-5bf9-4812-6ee3-3f2a9afefacc","value":"{\"cell_id\":\"b1045897-55c0-413f-91b7-1fa54bfadf21\",\"rep_address\":\"http://10.0.73.2:1800\",\"zone\":\"z1\",\"capacity\":{\"memory_mb\":31843,\"disk_mb\":62735,\"containers\":249},\"rootfs_provider_list\":[{\"name\":\"preloaded\",\"properties\":[\"cflinuxfs3\"]},{\"name\":\"preloaded+layer\",\"properties\":[\"cflinuxfs3\"]},{\"name\":\"docker\"}],\"rep_url\":\"https://b1045897-55c0-413f-91b7-1fa54bfadf21.cell.service.cf.internal:1801\"}","type":"presence","type_code":2},"session":"12","ttl_in_seconds":15}}

As you can see from the logs, when we have the timeout implemented, the client sends a request to the server every 10s and waits 3 seconds for it to respond. If it does not, the cell locket client connects to another server to maintain their presence. In comparison with the first case, the cell loses it's presence for almost a minute and does not connect to a healthy locket server. I will wait for you opinion on this and in the meantime, I will test the scenario of a whole zone being down and share the results.

klapkov commented 1 year ago

Hello @geofffranks,

I managed to collect some logs from test where I introduce a large network delay in a whole zone. Here are the results:

Tests with the current locket client:

Test 1: BBS in z3 - holds lock in z1 Auctioneer in z2 - holds lock in z3 Network delay in z1 for 60s with 5s delay

BBS in z3:
{"timestamp":"2022-12-08T05:01:44.205206678Z","level":"error","source":"bbs","message":"bbs.locket-lock.lost-lock","│ata":{"duration":15000565234,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded","lock":{"key":"bbs","owner":"ef69d633-7e32-4db5-bdfb-d1deb65f8fc6","type":"lock","type_code":1},"request-uuid":"6ef31452-a5e3-4cee-5325-0f952698c691","session":"4","ttl_in_seconds":15}}

{"timestamp":"2022-12-08T05:01:55.554451350Z","level":"info","source":"bbs","message":"bbs.locket-lock.acquired-lock","data":{"lock":{"key":"bbs","owner":"ef69d633-7e32-4db5-bdfb-d1deb65f8fc6","type":"lock","type_code":1},"session":"4","ttl_in_seconds":15}}

Result: BBS connected to another locket server after 11s

Test 2: BBS in z3 - holds lock in z3 Auctioneer in z2 - holds lock in z3 Network delay in z3 for 60s with 5s delay

BBS in z3:
{"timestamp":"2022-12-08T05:04:42.053826740Z","level":"error","source":"bbs","message":"bbs.locket-lock.lost-lock","data":{"duration":15000411885,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded","lock":{"key":"bbs","owner":"ef69d633-7e32-4db5-bdfb-d1deb65f8fc6","type":"lock","type_code":1},"request-uuid":"2ffc82f0-42c8-4ee2-702d-ff1d4fe56d07","session":"4","ttl_in_seconds":15}}

BBS in z1:
{"timestamp":"2022-12-08T05:04:58.055139163Z","level":"info","source":"bbs","message":"bbs.locket-lock.acquired-lock","data":{"lock":{"key":"bbs","owner":"e44d4434-71d6-4328-851d-f1cc3f1e334b","type":"lock","type_code":1},"session":"4","ttl_in_seconds":15}}
{"timestamp":"2022-12-08T05:04:58.055803265Z","level":"info","source":"bbs","message":"bbs.set-lock-held-metron-notifier.started","data":{"session":"5"}} 

Auctioneer in z2: 
{"timestamp":"2022-12-08T05:04:42.055663684Z","level":"error","source":"auctioneer","message":"auctioneer.locket-lock.lost-lock","data":{"duration":15000852951,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded","lock":{"key":"auctioneer","owner":"9bd37427-125c-40e7-90d1-de3133982085","type":"lock","type_code":1},"request-uuid":"904a3e47-f1aa-4242-43d1-ede0f121f525","session":"2","ttl_in_seconds":15}}

{"timestamp":"2022-12-08T05:04:57.083761429Z","level":"info","source":"auctioneer","message":"auctioneer.locket-lock.acquired-lock","data":{"lock":{"key":"auctioneer","owner":"9bd37427-125c-40e7-90d1-de3133982085","type":"lock","type_code":1},"session":"2","ttl_in_seconds":15}}

Result: BBS failed over to another instance and Auctioneer managed to connect to another server. BBS failed over after 16s, Auctioneer switched to another server after 15s

Test 3: BBS in z3 - holds lock in z3 Auctioneer in z3 - holds lock in z3 Network delay in z3 for 120s with 10s delay

BBS in z3: 
{"timestamp":"2022-12-08T05:49:51.484693267Z","level":"error","source":"bbs","message":"bbs.locket-lock.lost-lock","data":{"duration":15000650622,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded","lock":{"key":"bbs","owner":"ef69d633-7e32-4db5-bdfb-d1deb65f8fc6","type":"lock","type_code":1},"request-uuid":"d9d029f2-a898-4946-5aca-7a98254aa446","session":"4","ttl_in_seconds":15}}

BBS in z2: 
{"timestamp":"2022-12-08T05:50:02.029613028Z","level":"info","source":"bbs","message":"bbs.locket-lock.acquired-lock","data":{"lock":{"key":"bbs","owner":"f5d334ee-8ca7-4cec-ae3b-b49627c947a6","type":"lock","type_code":1},"session":"4","ttl_in_seconds":15}}

Auctioneer in z3: 
{"timestamp":"2022-12-08T05:49:51.525670456Z","level":"error","source":"auctioneer","message":"auctioneer.locket-lock.lost-lock","data":{"duration":15000076967,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded","lock":{"key":"auctioneer","owner":"33b28858-fdd1-4c79-93fc-851593cc2335","type":"lock","type_code":1},"request-uuid":"cced7bd0-8a11-4da1-4625-14a112625c35","session":"2","ttl_in_seconds":15}}

Auctioneer in z1: 
{"timestamp":"2022-12-08T05:50:03.873578995Z","level":"info","source":"auctioneer","message":"auctioneer.locket-lock.acquired-lock","data":{"lock":{"key":"auctioneer","owner":"1812e6dd-80cf-47b9-8d10-5c2cfa06a64a","type":"lock","type_code":1},"session":"2","ttl_in_seconds":15}}

Result: Both of them lost locks and new instances acquired them. BBS acquired after 70s, Auctioneer after 65s.

Tests after applying a 3s timeout to the locket client:

Test 1: BBS in z1 - holds lock in z1 Auctioneer in z2 - holds lock in z1 Network delay in z1 for 60s with 5s delay

BBS in z1: 
{"timestamp":"2022-12-08T08:18:23.465212316Z","level":"error","source":"bbs","message":"bbs.locket-lock.lost-lock","data":{"duration":15000458079,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded","lock":{"key":"bbs","owner":"e44d4434-71d6-4328-851d-f1cc3f1e334b","type":"lock","type_code":1},"request-uuid":"f8864760-b82c-41cd-4697-a7ba8c51bad7","session":"4","ttl_in_seconds":15}}

BBS in z3: 
{"timestamp":"2022-12-08T08:18:23.538421985Z","level":"info","source":"bbs","message":"bbs.locket-lock.acquired-lock","data":{"lock":{"key":"bbs","owner":"ef69d633-7e32-4db5-bdfb-d1deb65f8fc6","type":"lock","type_code":1},"session":"4","ttl_in_seconds":15}}

Auctioneer in z2: 
{"timestamp":"2022-12-08T08:18:11.450527693Z","level":"error","source":"auctioneer","message":"auctioneer.locket-lock.lost-lock","data":{"duration":3417569434,"error":"rpc error: code = Unavailable desc = transport is closing","lock":{"key":"auctioneer","owner":"9bd37427-125c-40e7-90d1-de3133982085","type":"lock","type_code":1},"request-uuid":"ca255057-e0b9-445d-5551-af4ae109c576","session":"2","ttl_in_seconds":15}}

{"timestamp":"2022-12-08T08:18:13.084719092Z","level":"info","source":"auctioneer","message":"auctioneer.locket-lock.acquired-lock","data":{"lock":{"key":"auctioneer","owner":"9bd37427-125c-40e7-90d1-de3133982085","type":"lock","type_code":1},"session":"2","ttl_in_seconds":15}}

Results: BBS in z3 acquired the lock, Auctioneer connected to another locket server. BBS almost immediately, Auctioneer after 2s

Test 2: BBS in z3 - holds lock in z2 Auctioneer in z2 - holds lock in z3 Network delay in z2 for 60s with 5s delay

BBS in z3: 
{"timestamp":"2022-12-08T08:29:41.436702439Z","level":"error","source":"bbs","message":"bbs.locket-lock.lost-lock","data":{"duration":3417857072,"error":"rpc error: code = Unavailable desc = transport is closing","lock":{"key":"bbs","owner":"ef69d633-7e32-4db5-bdfb-d1deb65f8fc6","type":"lock","type_code":1},"request-uuid":"55a5b840-da38-4bdd-6458-7f8e25e8bfc7","session":"4","ttl_in_seconds":15}}

{"timestamp":"2022-12-08T08:29:42.622502076Z","level":"info","source":"bbs","message":"bbs.locket-lock.acquired-lock","data":{"lock":{"key":"bbs","owner":"ef69d633-7e32-4db5-bdfb-d1deb65f8fc6","type":"lock","type_code":1},"session":"4","ttl_in_seconds":15}}

Auctioneer in z2: 
{"timestamp":"2022-12-08T08:29:41.818504598Z","level":"error","source":"auctioneer","message":"auctioneer.locket-lock.lost-lock","data":{"duration":3009429384,"error":"rpc error: code = Unavailable desc = transport is closing","lock":{"key":"auctioneer","owner":"9bd37427-125c-40e7-90d1-de3133982085","type":"lock","type_code":1},"request-uuid":"4b66b6f6-e48c-4a3d-443c-ba6ab8ccea68","session":"2","ttl_in_seconds":15}}

Auctioneer in z3: 
{"timestamp":"2022-12-08T08:29:58.926958775Z","level":"info","source":"auctioneer","message":"auctioneer.locket-lock.acquired-lock","data":{"lock":{"key":"auctioneer","owner":"33b28858-fdd1-4c79-93fc-851593cc2335","type":"lock","type_code":1},"session":"2","ttl_in_seconds":15}}

Result: BBS connected to another locket server, Auctioneer failed over to z3. BBS after 1s, Auctioneer after 17s

Test 3: BBS in z3 - holds lock in z3 Auctioneer in z3 - holds lock in z2 Network delay in z3 for 120s with 10s delay

BBS in z3: 
{"timestamp":"2022-12-08T13:38:11.844973084Z","level":"error","source":"bbs","message":"bbs.locket-lock.lost-lock","data":{"duration":3025798967,"error":"rpc error: code = Unavailable desc = transport is closing","lock":{"key":"bbs","owner":"ef69d633-7e32-4db5-bdfb-d1deb65f8fc6","type":"lock","type_code":1},"request-uuid":"379cd7bb-3fb5-4a42-5a71-47849b46e332","session":"4","ttl_in_seconds":15}}

BBS in z1: 
{"timestamp":"2022-12-08T13:38:34.059274495Z","level":"info","source":"bbs","message":"bbs.locket-lock.acquired-lock","data":{"lock":{"key":"bbs","owner":"e44d4434-71d6-4328-851d-f1cc3f1e334b","type":"lock","type_code":1},"session":"4","ttl_in_seconds":15}}

Auctioneer in z3: 
{"timestamp":"2022-12-08T13:38:10.872863917Z","level":"error","source":"auctioneer","message":"auctioneer.locket-lock.lost-lock","data":{"duration":3011507228,"error":"rpc error: code = Unavailable desc = transport is closing","lock":{"key":"auctioneer","owner":"33b28858-fdd1-4c79-93fc-851593cc2335","type":"lock","type_code":1},"request-uuid":"d0f0f277-4909-456c-5ec7-7dc645723f0f","session":"2","ttl_in_seconds":15}}

Auctioneer in z2: 
{"timestamp":"2022-12-08T13:38:33.332800975Z","level":"info","source":"auctioneer","message":"auctioneer.locket-lock.acquired-lock","data":{"lock":{"key":"auctioneer","owner":"9bd37427-125c-40e7-90d1-de3133982085","type":"lock","type_code":1},"session":"2","ttl_in_seconds":15}}

Result: Both failed over to other instances. BBS after 23s, Auctioneer after 23s.

To recap these findings, I did not experience any weird behaviour or observed any negative effects after the change. Rather the opposite, I see great improvement in the performance and speed of the locket client. Please share your thoughts.

Regards, Konstantin

geofffranks commented 1 year ago

Thanks for all the additional info @klapkov!

Looks like you had a locket PR associated with this, but that's closed now. Was that intentional? Were there any other changes as part of addressing this?

Your testing looks promising! I'm going to run this by some teammates with more historical knowledge on the locket/diego codebase to double check though.

geofffranks commented 1 year ago

Hey! We're aligned on fixing it with this timeout. Can you make it a configurable parameter though in both locket + bosh, so that operators can override this if needed? Defaulting to 10s seems reasonable.

klapkov commented 1 year ago

Hello @geofffranks

The previous PR I opened was with a static timeout. As you said, it will be good to have these properties configurable so operators have flexibility. Thanks for the response, I will start preparing a PR