TritonDataCenter / smartos-live

For more information, please see http://smartos.org/ For any questions that aren't answered there, please join the SmartOS discussion list: https://smartos.topicbox.com/groups/smartos-discuss
1.56k stars 244 forks source link

mdata-put randomly hangs #990

Open blackwood821 opened 3 years ago

blackwood821 commented 3 years ago

We have an upgrade routine where after we reprovision a VM to a new image our user-script runs an upgrade process on the zone. Every once in a while our script fails because it runs an mdata-put command that times out. Here is an example outside of our script that recreates the issue:

[root@ds (us-west-agc:bnbntl) ~]# mdata-put set:state success
plat_recv timeout
receive timeout, resetting protocol...
plat_recv timeout
receive timeout, resetting protocol...
plat_recv timeout
receive timeout, resetting protocol...
plat_recv timeout
receive timeout, resetting protocol...

Restarting the vminfod service via svcadm restart vminfod on the CN global always seems to fix the issue.

There is more info available at https://gist.github.com/blackwood821/d343ee38d76a44f5ebb1429828787199.

blackwood821 commented 3 years ago

Here is a recent error from the metadata service log in the global zone:

[2021-07-09T20:38:05.353Z]  INFO: metadata/752543 on cn01: handled GET setup:messages (brand=joyent, zonename=7d38ff30-de00-46af-9a8b-16c41c6220e7, response=SUCCESS)
    query: {
      "cmd": "GET",
      "arg": "setup:messages",
      "req_v": 2,
      "elapsed": 0.33317
    }
[2021-07-09T20:38:30.234Z] DEBUG: metadata/752543 on cn01: createServersOnExistingZones for 38 zones
[2021-07-09T20:38:30.234Z]  INFO: metadata/752543 on cn01: created zone metadata sockets on 0 / 38 zones took 0.000385944s (385.94us)
[2021-07-09T20:39:05.481Z] ERROR: metadata/752543 on cn01: could not set metadata (key "setup:messages") (brand=joyent, zonename=7d38ff30-de00-46af-9a8b-16c41c6220e7)
    MultiError: first of 1 error: first of 1 error: vminfod watchForEvent "VM.js writeAllMetadata (7d38ff30-de00-46af-9a8b-16c41c6220e7: customer_metadata,internal_metadata,tags)" timeout exceeded
        at /usr/vm/node_modules/vasync/lib/vasync.js:99:14
        at /usr/vm/node_modules/VM.js:16887:29
        at /usr/vm/node_modules/VM.js:17072:13
        at /usr/node/0.10/node_modules/async.js:240:13
        at /usr/node/0.10/node_modules/async.js:144:21
        at /usr/node/0.10/node_modules/async.js:237:17
        at /usr/node/0.10/node_modules/async.js:600:34
        at /usr/vm/node_modules/VM.js:17014:21
        at /usr/vm/node_modules/VM.js:3690:9
        at next (/usr/vm/node_modules/vasync/lib/vasync.js:843:6)
    Caused by: MultiError: first of 1 error: vminfod watchForEvent "VM.js writeAllMetadata (7d38ff30-de00-46af-9a8b-16c41c6220e7: customer_metadata,internal_metadata,tags)" timeout exceeded
        at /usr/vm/node_modules/vasync/lib/vasync.js:99:14
        at done (/usr/vm/node_modules/vminfod/client.js:697:9)
        at vminfodEventStreamWatchForEventTimeout (/usr/vm/node_modules/vminfod/client.js:635:13)
        at Timer.listOnTimeout [as ontimeout] (timers.js:110:15)
    Caused by: Error: vminfod watchForEvent "VM.js writeAllMetadata (7d38ff30-de00-46af-9a8b-16c41c6220e7: customer_metadata,internal_metadata,tags)" timeout exceeded
        at vminfodEventStreamWatchForEventTimeout (/usr/vm/node_modules/vminfod/client.js:633:23)
        at Timer.listOnTimeout [as ontimeout] (timers.js:110:15)
[2021-07-09T20:39:05.481Z] ERROR: metadata/752543 on cn01: first of 1 error: first of 1 error: vminfod watchForEvent "VM.js writeAllMetadata (7d38ff30-de00-46af-9a8b-16c41c6220e7: customer_metadata,internal_metadata,tags)" timeout exceeded (brand=joyent, zonename=7d38ff30-de00-46af-9a8b-16c41c6220e7)
[2021-07-09T20:39:05.481Z] ERROR: metadata/752543 on cn01: Socket for 7d38ff30-de00-46af-9a8b-16c41c6220e7 closed before we could write anything. (brand=joyent, zonename=7d38ff30-de00-46af-9a8b-16c41c6220e7)
[2021-07-09T20:39:05.481Z]  INFO: metadata/752543 on cn01: handled PUT setup:messages (brand=joyent, zonename=7d38ff30-de00-46af-9a8b-16c41c6220e7, response=FAILURE)
    MultiError: first of 1 error: first of 1 error: vminfod watchForEvent "VM.js writeAllMetadata (7d38ff30-de00-46af-9a8b-16c41c6220e7: customer_metadata,internal_metadata,tags)" timeout exceeded
        at /usr/vm/node_modules/vasync/lib/vasync.js:99:14
        at /usr/vm/node_modules/VM.js:16887:29
        at /usr/vm/node_modules/VM.js:17072:13
        at /usr/node/0.10/node_modules/async.js:240:13
        at /usr/node/0.10/node_modules/async.js:144:21
        at /usr/node/0.10/node_modules/async.js:237:17
        at /usr/node/0.10/node_modules/async.js:600:34
        at /usr/vm/node_modules/VM.js:17014:21
        at /usr/vm/node_modules/VM.js:3690:9
        at next (/usr/vm/node_modules/vasync/lib/vasync.js:843:6)
    Caused by: MultiError: first of 1 error: vminfod watchForEvent "VM.js writeAllMetadata (7d38ff30-de00-46af-9a8b-16c41c6220e7: customer_metadata,internal_metadata,tags)" timeout exceeded
        at /usr/vm/node_modules/vasync/lib/vasync.js:99:14
        at done (/usr/vm/node_modules/vminfod/client.js:697:9)
        at vminfodEventStreamWatchForEventTimeout (/usr/vm/node_modules/vminfod/client.js:635:13)
        at Timer.listOnTimeout [as ontimeout] (timers.js:110:15)
    Caused by: Error: vminfod watchForEvent "VM.js writeAllMetadata (7d38ff30-de00-46af-9a8b-16c41c6220e7: customer_metadata,internal_metadata,tags)" timeout exceeded
        at vminfodEventStreamWatchForEventTimeout (/usr/vm/node_modules/vminfod/client.js:633:23)
        at Timer.listOnTimeout [as ontimeout] (timers.js:110:15)
    --
    query: {
      "cmd": "PUT",
      "arg": "setup:messages",
      "req_v": 2,
      "elapsed": 60120.467837
    }

I don't see any errors in the vminfod service but that seems to be the one that needs to be restarted each time.