cloudfoundry / diego-release

BOSH Release for Diego
Apache License 2.0
201 stars 210 forks source link

Document expected locket related logs #770

Open ameowlia opened 1 year ago

ameowlia commented 1 year ago

I am currently debugging an issue around locket and I was trying to figure out which logs are part of normal operation. It would great to document this better in the future. For now I am going to document it here.

Setup for testing "lock" locks

{"timestamp":"2023-06-22T18:01:32.238648111Z","level":"error","source":"bbs","message":"bbs.locket-lock.failed-to-acquire-lock", "data":{"error":"rpc error: code = AlreadyExists desc = lock-collision","lock":{"key":"bbs","owner":"fe89652e-4e59-4efc-b75c-88cc029ed816","type":"lock","type_code":1}, "lock-owner":"9e859b27-ac41-4e39-ba59-bb87d520d987","request-uuid":"7ca981d9-9375-45a8-7ddc-8940dc662643","session":"4","ttl_in_seconds":15}}

locket.stdout.log

There is a matching log in locket from where the bbs on control/fe89652e-4e59-4efc-b75c-88cc029ed81 attempted to grab the lock and failed

{"timestamp":"2023-06-22T18:01:32.238648111Z","level":"error","source":"bbs","message":"bbs.locket-lock.failed-to-acquire-lock", "data":{"error":"rpc error: code = AlreadyExists desc = lock-collision","lock":{"key":"bbs","owner":"fe89652e-4e59-4efc-b75c-88cc029ed816","type":"lock","type_code":1}, "lock-owner":"9e859b27-ac41-4e39-ba59-bb87d520d987","request-uuid":"7ca981d9-9375-45a8-7ddc-8940dc662643","session":"4","ttl_in_seconds":15}}

In the locket.lock db table there is one lock for bbs. And the owner is the VM guid of the VM where the bbs that has the lock is running.

The 'modified_index' is constantly being sequentially updated.

mysql> select * from locket.locks where path = "bbs"; +------+--------------------------------------+-------+------+----------------+--------------------------------------+------+ | path | owner | value | type | modified_index | modified_id | ttl | +------+--------------------------------------+-------+------+----------------+--------------------------------------+------+ | bbs | 9e859b27-ac41-4e39-ba59-bb87d520d987 | | lock | 4352 | d6d8b4f2-8d1e-4ba0-71ca-fdbf6b0bad58 | 15 | +------+--------------------------------------+-------+------+----------------+--------------------------------------+------+


## Changing who has locks
If the bbs that currently has the lock restarts, then it will give up the lock and the other bbs will claim the lock. I did this by running `monit restart bbs` on `control/9e859b27-ac41-4e39-ba59-bb87d520d987`.

locket.stdout.log

The bbs on VM '9e859b27-ac41-4e39-ba59-bb87d520d987' gives up the lock

{"timestamp":"2023-06-22T19:16:05.620930708Z","level":"info","source":"locket","message":"locket.release.release-lock.released-lock","data":{"key":"bbs","owner":"9e859b27-ac41-4e39-ba59-bb87d520d987","session":"1774728.1","type":"lock","type-code":1}}

the bbs on VM 'fe89652e-4e59-4efc-b75c-88cc029ed816' successfully claims the lock.

{"timestamp":"2023-06-22T19:16:06.089322785Z","level":"info","source":"locket","message":"locket.lock.lock.acquired-lock","data":{"key":"bbs","owner":"fe89652e-4e59-4efc-b75c-88cc029ed816","request-uuid":"7e2a16d0-56a7-40dc-5d04-ca7275364cc2","session":"1774735.1","type":"lock","type-code":1}}

the bbs on VM '9e859b27-ac41-4e39-ba59-bb87d520d987' is restarted and attempts to claim the lock

{"timestamp":"2023-06-22T19:16:20.492246371Z","level":"info","source":"locket","message":"locket.lock.register-ttl.fetch-and-release-lock.fetched-lock","data":{"key":"bbs","modified-index":4431,"owner":"9e859b27-ac41-4e39-ba59-bb87d520d987","request-uuid":"05c37d7c-4163-4b64-4d9f-30af73a0860d","session":"1774724.2.1","type":"lock","type-code":1}}

the bbs on VM '9e859b27-ac41-4e39-ba59-bb87d520d987' fails to claim the lock, because the bbs on VM 'fe89652e-4e59-4efc-b75c-88cc029ed816' has the lock

{"timestamp":"2023-06-22T19:16:20.493343559Z","level":"error","source":"locket","message":"locket.lock.register-ttl.fetch-and-release-lock.fetch-failed-owner-mismatch","data":{"error":"rpc error: code = AlreadyExists desc = lock-collision","fetched-owner":"fe89652e-4e59-4efc-b75c-88cc029ed816","key":"bbs","modified-index":4431,"owner":"9e859b27-ac41-4e39-ba59-bb87d520d987","request-uuid":"05c37d7c-4163-4b64-4d9f-30af73a0860d","session":"1774724.2.1","type":"lock","type-code":1}} {"timestamp":"2023-06-22T19:16:20.495265778Z","level":"error","source":"locket","message":"locket.lock.register-ttl.failed-compare-and-release","data":{"error":"rpc error: code = AlreadyExists desc = lock-collision","key":"bbs","modified-index":4431,"request-uuid":"05c37d7c-4163-4b64-4d9f-30af73a0860d","session":"1774724.2","type":"lock"}}

the lock in the db is now updated with the new owner.

mysql> select * from locket.locks where path = "bbs"; +------+--------------------------------------+-------+------+----------------+--------------------------------------+------+ | path | owner | value | type | modified_index | modified_id | ttl | +------+--------------------------------------+-------+------+----------------+--------------------------------------+------+ | bbs | fe89652e-4e59-4efc-b75c-88cc029ed816 | | lock | 98 | cba003ea-608f-4cdf-539e-0304896dc313 | 15 | +------+--------------------------------------+-------+------+----------------+--------------------------------------+------+ 1 row in set (0.00 sec)

control/9e859b27-ac41-4e39-ba59-bb87d520d987

bbs.stdout.log

this bbs has successfully restarted.

{"timestamp":"2023-06-22T19:16:07.128143147Z","level":"info","source":"bbs","message":"bbs.locket-lock.started","data":{"lock":{"key":"bbs","owner":"9e859b27-ac41-4e39-ba59-bb87d520d987","type":"lock","type_code":1},"session":"4","ttl_in_seconds":15}}

this bbs fails to claim the lock because the other bbs has the lock

{"timestamp":"2023-06-22T19:16:07.136452192Z","level":"error","source":"bbs","message":"bbs.locket-lock.failed-to-acquire-lock","data":{"error":"rpc error: code = AlreadyExists desc = lock-collision","lock":{"key":"bbs","owner":"9e859b27-ac41-4e39-ba59-bb87d520d987","type":"lock","type_code":1},"lock-owner":"fe89652e-4e59-4efc-b75c-88cc029ed816","request-uuid":"efb3d6c6-9585-4efb-6f74-126dcfccb99c","session":"4","ttl_in_seconds":15}}

control/fe89652e-4e59-4efc-b75c-88cc029ed816

bbs.stdout.log

this bbs grabs the lock as soon as it is up for grabs

{"timestamp":"2023-06-22T19:16:06.091053490Z","level":"info","source":"bbs","message":"bbs.locket-lock.acquired-lock","data":{"lock":{"key":"bbs","owner":"fe89652e-4e59-4efc-b75c-88cc029ed816","type":"lock","type_code":1},"session":"4", "ttl_in_seconds":15}}

once it grabs the lock it gets to work!

{"timestamp":"2023-06-22T19:16:06.092710053Z","level":"info","source":"bbs","message":"bbs.set-lock-held-metron-notifier.started","data":{"session":"5"}} {"timestamp":"2023-06-22T19:16:06.093315811Z","level":"info","source":"bbs","message":"bbs.task-completion-workpool.starting","data":{"session":"1"}} ...

ameowlia commented 1 year ago

Setup for testing "presence" locks.

Presence locks are registered by the rep on each diego cell. It puts information about the diego cell into the lock db so BBS can know which diego cells are available to run workloads.

I have two diego cells with GUIDs: 0e776c46-fd44-48f4-83ea-06b1dd13ca4b and 8ba76ac6-4809-4255-b59b-2e3878bf23d1.

# locket.lock db table
# there is one presence lock per diego cell, where the path is the GUID of the diego cell VM.

mysql> select path, type from locks;
+--------------------------------------+----------+
| path                                 | type     |
+--------------------------------------+----------+
| 0e776c46-fd44-48f4-83ea-06b1dd13ca4b | presence |
| 8ba76ac6-4809-4255-b59b-2e3878bf23d1 | presence |
| auctioneer                           | lock     |
| bbs                                  | lock     |
| cc-deployment-updater                | lock     |
| policy-server-asg-syncer             | lock     |
| routing_api_lock                     | lock     |
| tps_watcher                          | lock     |
+--------------------------------------+----------+
8 rows in set (0.00 sec)

Changing the presence locks

  1. run monit retstart rep on diego cell 0e776c46-fd44-48f4-83ea-06b1dd13ca4b. This simulates an upgrade situation where the rep is restarted.
# rep.stdout.log

# rep exits
{"timestamp":"2023-06-22T20:19:38.920038544Z","level":"info","source":"rep","message":"rep.exited","data":{}}

# rep restarts 
{"timestamp":"2023-06-22T20:19:40.263986195Z","level":"info","source":"rep","message":"rep.wait-for-garden.ping-garden","data":{"initialTime:":"2023-06-22T20:19:40.263953792Z","session":"1","wait-time-ns:":30876}}

...
# rep gets the lock again
{"timestamp":"2023-06-22T20:19:42.926917888Z","level":"info","source":"rep","message":"rep.locket-lock.acquired-lock","data":{"lock":{"key":"0e776c46-fd44-48f4-83ea-06b1dd13ca4b","owner":"c7ed4409-d33a-4b9b-644c-e5cdd873568f","value":"{
\"cell_id\":\"0e776c46-fd44-48f4-83ea-06b1dd13ca4b\",\"rep_address\":\"http://10.0.4.8:1800\",\"zone\":\"us-central1-f\",\"capacity\":{\"memory_mb\":12977,\"disk_mb\":104349,\"containers\":249},\"rootfs_provider_list\":[{\"name\":\"prel
oaded\",\"properties\":[\"cflinuxfs3\",\"cflinuxfs4\"]},{\"name\":\"preloaded+layer\",\"properties\":[\"cflinuxfs3\",\"cflinuxfs4\"]},{\"name\":\"docker\"}],\"rep_url\":\"https://0e776c46-fd44-48f4-83ea-06b1dd13ca4b.cell.service.cf.inte
rnal:1801\"}","type":"presence","type_code":2},"session":"12","ttl_in_seconds":15}}
# locket.stdout.log

# lock is released when rep is restarted
{"timestamp":"2023-06-22T20:19:38.914625536Z","level":"info","source":"locket","message":"locket.release.release-lock.released-lock","data":{"key":"0e776c46-fd44-48f4-83ea-06b1dd13ca4b","owner":"78e5a54a-cc4f-4f67-5efc-6c2309bd5d28","session":"1815120.1","type":"presence","type-code":2}}

# lock is acquired again
{"timestamp":"2023-06-22T20:19:42.926185907Z","level":"info","source":"locket","message":"locket.lock.lock.acquired-lock","data":{"key":"0e776c46-fd44-48f4-83ea-06b1dd13ca4b","owner":"c7ed4409-d33a-4b9b-644c-e5cdd873568f","request-uuid":"3293a3eb-3c86-4965-457b-3c54cd3c4199","session":"1815164.1","type":"presence","type-code":2}}

# for some reason it tries to grab the lock again (?) but can't (?). This confusing error seems to be a part of the normal process.
{"timestamp":"2023-06-22T20:19:53.862699986Z","level":"info","source":"locket","message":"locket.lock.register-ttl.fetch-and-release-lock.fetched-lock","data":{"key":"0e776c46-fd44-48f4-83ea-06b1dd13ca4b","modified-index":91,"owner":"78e5a54a-cc4f-4f67-5efc-6c2309bd5d28","request-uuid":"733a2b67-f984-46ff-4f04-59348b6e47d7","session":"1815119.2.1","type":"presence","type-code":2}}
{"timestamp":"2023-06-22T20:19:53.862867442Z","level":"error","source":"locket","message":"locket.lock.register-ttl.fetch-and-release-lock.fetch-failed-owner-mismatch","data":{"error":"rpc error: code = AlreadyExists desc = lock-collision","fetched-owner":"c7ed4409-d33a-4b9b-644c-e5cdd873568f","key":"0e776c46-fd44-48f4-83ea-06b1dd13ca4b","modified-index":91,"owner":"78e5a54a-cc4f-4f67-5efc-6c2309bd5d28","request-uuid":"733a2b67-f984-46ff-4f04-59348b6e47d7","session":"1815119.2.1","type":"presence","type-code":2}}
{"timestamp":"2023-06-22T20:19:53.863737078Z","level":"error","source":"locket","message":"locket.lock.register-ttl.failed-compare-and-release","data":{"error":"rpc error: code = AlreadyExists desc = lock-collision","key":"0e776c46-fd44-48f4-83ea-06b1dd13ca4b","modified-index":91,"request-uuid":"733a2b67-f984-46ff-4f04-59348b6e47d7","session":"1815119.2","type":"presence"}}
winkingturtle-vmw commented 5 months ago

@ameowlia Should we keep this issue open, close it, or move this into a markdown file?