Open ameowlia opened 1 year ago
Presence locks are registered by the rep on each diego cell. It puts information about the diego cell into the lock db so BBS can know which diego cells are available to run workloads.
I have two diego cells with GUIDs: 0e776c46-fd44-48f4-83ea-06b1dd13ca4b
and 8ba76ac6-4809-4255-b59b-2e3878bf23d1
.
# locket.lock db table
# there is one presence lock per diego cell, where the path is the GUID of the diego cell VM.
mysql> select path, type from locks;
+--------------------------------------+----------+
| path | type |
+--------------------------------------+----------+
| 0e776c46-fd44-48f4-83ea-06b1dd13ca4b | presence |
| 8ba76ac6-4809-4255-b59b-2e3878bf23d1 | presence |
| auctioneer | lock |
| bbs | lock |
| cc-deployment-updater | lock |
| policy-server-asg-syncer | lock |
| routing_api_lock | lock |
| tps_watcher | lock |
+--------------------------------------+----------+
8 rows in set (0.00 sec)
monit retstart rep
on diego cell 0e776c46-fd44-48f4-83ea-06b1dd13ca4b
. This simulates an upgrade situation where the rep is restarted.# rep.stdout.log
# rep exits
{"timestamp":"2023-06-22T20:19:38.920038544Z","level":"info","source":"rep","message":"rep.exited","data":{}}
# rep restarts
{"timestamp":"2023-06-22T20:19:40.263986195Z","level":"info","source":"rep","message":"rep.wait-for-garden.ping-garden","data":{"initialTime:":"2023-06-22T20:19:40.263953792Z","session":"1","wait-time-ns:":30876}}
...
# rep gets the lock again
{"timestamp":"2023-06-22T20:19:42.926917888Z","level":"info","source":"rep","message":"rep.locket-lock.acquired-lock","data":{"lock":{"key":"0e776c46-fd44-48f4-83ea-06b1dd13ca4b","owner":"c7ed4409-d33a-4b9b-644c-e5cdd873568f","value":"{
\"cell_id\":\"0e776c46-fd44-48f4-83ea-06b1dd13ca4b\",\"rep_address\":\"http://10.0.4.8:1800\",\"zone\":\"us-central1-f\",\"capacity\":{\"memory_mb\":12977,\"disk_mb\":104349,\"containers\":249},\"rootfs_provider_list\":[{\"name\":\"prel
oaded\",\"properties\":[\"cflinuxfs3\",\"cflinuxfs4\"]},{\"name\":\"preloaded+layer\",\"properties\":[\"cflinuxfs3\",\"cflinuxfs4\"]},{\"name\":\"docker\"}],\"rep_url\":\"https://0e776c46-fd44-48f4-83ea-06b1dd13ca4b.cell.service.cf.inte
rnal:1801\"}","type":"presence","type_code":2},"session":"12","ttl_in_seconds":15}}
# locket.stdout.log
# lock is released when rep is restarted
{"timestamp":"2023-06-22T20:19:38.914625536Z","level":"info","source":"locket","message":"locket.release.release-lock.released-lock","data":{"key":"0e776c46-fd44-48f4-83ea-06b1dd13ca4b","owner":"78e5a54a-cc4f-4f67-5efc-6c2309bd5d28","session":"1815120.1","type":"presence","type-code":2}}
# lock is acquired again
{"timestamp":"2023-06-22T20:19:42.926185907Z","level":"info","source":"locket","message":"locket.lock.lock.acquired-lock","data":{"key":"0e776c46-fd44-48f4-83ea-06b1dd13ca4b","owner":"c7ed4409-d33a-4b9b-644c-e5cdd873568f","request-uuid":"3293a3eb-3c86-4965-457b-3c54cd3c4199","session":"1815164.1","type":"presence","type-code":2}}
# for some reason it tries to grab the lock again (?) but can't (?). This confusing error seems to be a part of the normal process.
{"timestamp":"2023-06-22T20:19:53.862699986Z","level":"info","source":"locket","message":"locket.lock.register-ttl.fetch-and-release-lock.fetched-lock","data":{"key":"0e776c46-fd44-48f4-83ea-06b1dd13ca4b","modified-index":91,"owner":"78e5a54a-cc4f-4f67-5efc-6c2309bd5d28","request-uuid":"733a2b67-f984-46ff-4f04-59348b6e47d7","session":"1815119.2.1","type":"presence","type-code":2}}
{"timestamp":"2023-06-22T20:19:53.862867442Z","level":"error","source":"locket","message":"locket.lock.register-ttl.fetch-and-release-lock.fetch-failed-owner-mismatch","data":{"error":"rpc error: code = AlreadyExists desc = lock-collision","fetched-owner":"c7ed4409-d33a-4b9b-644c-e5cdd873568f","key":"0e776c46-fd44-48f4-83ea-06b1dd13ca4b","modified-index":91,"owner":"78e5a54a-cc4f-4f67-5efc-6c2309bd5d28","request-uuid":"733a2b67-f984-46ff-4f04-59348b6e47d7","session":"1815119.2.1","type":"presence","type-code":2}}
{"timestamp":"2023-06-22T20:19:53.863737078Z","level":"error","source":"locket","message":"locket.lock.register-ttl.failed-compare-and-release","data":{"error":"rpc error: code = AlreadyExists desc = lock-collision","key":"0e776c46-fd44-48f4-83ea-06b1dd13ca4b","modified-index":91,"request-uuid":"733a2b67-f984-46ff-4f04-59348b6e47d7","session":"1815119.2","type":"presence"}}
@ameowlia Should we keep this issue open, close it, or move this into a markdown file?
I am currently debugging an issue around locket and I was trying to figure out which logs are part of normal operation. It would great to document this better in the future. For now I am going to document it here.
Setup for testing "lock" locks
I have two VMs where BBS is deployed
control/9e859b27-ac41-4e39-ba59-bb87d520d987
👈 This VM started first. So the bbs on it has the lock.control/fe89652e-4e59-4efc-b75c-88cc029ed816
👈 This VM started second. So the bbs on it does not have the lock.{"timestamp":"2023-06-22T18:01:32.238648111Z","level":"error","source":"bbs","message":"bbs.locket-lock.failed-to-acquire-lock", "data":{"error":"rpc error: code = AlreadyExists desc = lock-collision","lock":{"key":"bbs","owner":"fe89652e-4e59-4efc-b75c-88cc029ed816","type":"lock","type_code":1}, "lock-owner":"9e859b27-ac41-4e39-ba59-bb87d520d987","request-uuid":"7ca981d9-9375-45a8-7ddc-8940dc662643","session":"4","ttl_in_seconds":15}}
locket.stdout.log
There is a matching log in locket from where the bbs on control/fe89652e-4e59-4efc-b75c-88cc029ed81 attempted to grab the lock and failed
{"timestamp":"2023-06-22T18:01:32.238648111Z","level":"error","source":"bbs","message":"bbs.locket-lock.failed-to-acquire-lock", "data":{"error":"rpc error: code = AlreadyExists desc = lock-collision","lock":{"key":"bbs","owner":"fe89652e-4e59-4efc-b75c-88cc029ed816","type":"lock","type_code":1}, "lock-owner":"9e859b27-ac41-4e39-ba59-bb87d520d987","request-uuid":"7ca981d9-9375-45a8-7ddc-8940dc662643","session":"4","ttl_in_seconds":15}}
In the locket.lock db table there is one lock for bbs. And the owner is the VM guid of the VM where the bbs that has the lock is running.
The 'modified_index' is constantly being sequentially updated.
mysql> select * from locket.locks where path = "bbs"; +------+--------------------------------------+-------+------+----------------+--------------------------------------+------+ | path | owner | value | type | modified_index | modified_id | ttl | +------+--------------------------------------+-------+------+----------------+--------------------------------------+------+ | bbs | 9e859b27-ac41-4e39-ba59-bb87d520d987 | | lock | 4352 | d6d8b4f2-8d1e-4ba0-71ca-fdbf6b0bad58 | 15 | +------+--------------------------------------+-------+------+----------------+--------------------------------------+------+
locket.stdout.log
The bbs on VM '9e859b27-ac41-4e39-ba59-bb87d520d987' gives up the lock
{"timestamp":"2023-06-22T19:16:05.620930708Z","level":"info","source":"locket","message":"locket.release.release-lock.released-lock","data":{"key":"bbs","owner":"9e859b27-ac41-4e39-ba59-bb87d520d987","session":"1774728.1","type":"lock","type-code":1}}
the bbs on VM 'fe89652e-4e59-4efc-b75c-88cc029ed816' successfully claims the lock.
{"timestamp":"2023-06-22T19:16:06.089322785Z","level":"info","source":"locket","message":"locket.lock.lock.acquired-lock","data":{"key":"bbs","owner":"fe89652e-4e59-4efc-b75c-88cc029ed816","request-uuid":"7e2a16d0-56a7-40dc-5d04-ca7275364cc2","session":"1774735.1","type":"lock","type-code":1}}
the bbs on VM '9e859b27-ac41-4e39-ba59-bb87d520d987' is restarted and attempts to claim the lock
{"timestamp":"2023-06-22T19:16:20.492246371Z","level":"info","source":"locket","message":"locket.lock.register-ttl.fetch-and-release-lock.fetched-lock","data":{"key":"bbs","modified-index":4431,"owner":"9e859b27-ac41-4e39-ba59-bb87d520d987","request-uuid":"05c37d7c-4163-4b64-4d9f-30af73a0860d","session":"1774724.2.1","type":"lock","type-code":1}}
the bbs on VM '9e859b27-ac41-4e39-ba59-bb87d520d987' fails to claim the lock, because the bbs on VM 'fe89652e-4e59-4efc-b75c-88cc029ed816' has the lock
{"timestamp":"2023-06-22T19:16:20.493343559Z","level":"error","source":"locket","message":"locket.lock.register-ttl.fetch-and-release-lock.fetch-failed-owner-mismatch","data":{"error":"rpc error: code = AlreadyExists desc = lock-collision","fetched-owner":"fe89652e-4e59-4efc-b75c-88cc029ed816","key":"bbs","modified-index":4431,"owner":"9e859b27-ac41-4e39-ba59-bb87d520d987","request-uuid":"05c37d7c-4163-4b64-4d9f-30af73a0860d","session":"1774724.2.1","type":"lock","type-code":1}} {"timestamp":"2023-06-22T19:16:20.495265778Z","level":"error","source":"locket","message":"locket.lock.register-ttl.failed-compare-and-release","data":{"error":"rpc error: code = AlreadyExists desc = lock-collision","key":"bbs","modified-index":4431,"request-uuid":"05c37d7c-4163-4b64-4d9f-30af73a0860d","session":"1774724.2","type":"lock"}}
the lock in the db is now updated with the new owner.
mysql> select * from locket.locks where path = "bbs"; +------+--------------------------------------+-------+------+----------------+--------------------------------------+------+ | path | owner | value | type | modified_index | modified_id | ttl | +------+--------------------------------------+-------+------+----------------+--------------------------------------+------+ | bbs | fe89652e-4e59-4efc-b75c-88cc029ed816 | | lock | 98 | cba003ea-608f-4cdf-539e-0304896dc313 | 15 | +------+--------------------------------------+-------+------+----------------+--------------------------------------+------+ 1 row in set (0.00 sec)
control/9e859b27-ac41-4e39-ba59-bb87d520d987
bbs.stdout.log
this bbs has successfully restarted.
{"timestamp":"2023-06-22T19:16:07.128143147Z","level":"info","source":"bbs","message":"bbs.locket-lock.started","data":{"lock":{"key":"bbs","owner":"9e859b27-ac41-4e39-ba59-bb87d520d987","type":"lock","type_code":1},"session":"4","ttl_in_seconds":15}}
this bbs fails to claim the lock because the other bbs has the lock
{"timestamp":"2023-06-22T19:16:07.136452192Z","level":"error","source":"bbs","message":"bbs.locket-lock.failed-to-acquire-lock","data":{"error":"rpc error: code = AlreadyExists desc = lock-collision","lock":{"key":"bbs","owner":"9e859b27-ac41-4e39-ba59-bb87d520d987","type":"lock","type_code":1},"lock-owner":"fe89652e-4e59-4efc-b75c-88cc029ed816","request-uuid":"efb3d6c6-9585-4efb-6f74-126dcfccb99c","session":"4","ttl_in_seconds":15}}
control/fe89652e-4e59-4efc-b75c-88cc029ed816
bbs.stdout.log
this bbs grabs the lock as soon as it is up for grabs
{"timestamp":"2023-06-22T19:16:06.091053490Z","level":"info","source":"bbs","message":"bbs.locket-lock.acquired-lock","data":{"lock":{"key":"bbs","owner":"fe89652e-4e59-4efc-b75c-88cc029ed816","type":"lock","type_code":1},"session":"4", "ttl_in_seconds":15}}
once it grabs the lock it gets to work!
{"timestamp":"2023-06-22T19:16:06.092710053Z","level":"info","source":"bbs","message":"bbs.set-lock-held-metron-notifier.started","data":{"session":"5"}} {"timestamp":"2023-06-22T19:16:06.093315811Z","level":"info","source":"bbs","message":"bbs.task-completion-workpool.starting","data":{"session":"1"}} ...