fleetdm / fleet

Open-source platform for IT, security, and infrastructure teams. (Linux, macOS, Chrome, Windows, cloud, data center)
https://fleetdm.com
Other
2.92k stars 405 forks source link

[fleetd] Unable to query fleetd tables #20397

Closed ksatter closed 1 month ago

ksatter commented 1 month ago

fleetd version: v1.27.0

Operating system:Ubuntu 20.04


💥  Actual behavior

On a fresh VM with a new install of fleetd, querying fleetd tables, the query intermittently fails with the following error:

vtable constructor failed: <table>

When Orbit initially launched, there was an immediate interrupt:

Jul 10 14:51:52 hostname orbit[11692]: 2024-07-10T14:51:52-07:00 INF orbit version: 1.27.0
Jul 10 14:51:52 hostname orbit[11692]: 2024-07-10T14:51:52-07:00 INF Found osquery version: 5.12.1
Jul 10 14:51:57 hostname orbit[11692]: 2024-07-10T14:51:57-07:00 ERR interrupt serviceChecker
Jul 10 14:51:57 hostname orbit[11692]: 2024-07-10T14:51:57-07:00 ERR interrupt updater
Jul 10 14:51:57 hostname orbit[11692]: 2024-07-10T14:51:57-07:00 ERR interrupt config receivers
Jul 10 14:51:57 hostname orbit[11692]: 2024-07-10T14:51:57-07:00 ERR interrupt osquery
Jul 10 14:51:57 shostname orbit[11692]: 2024-07-10T14:51:57-07:00 ERR interrupt osquery extension

Then osquery starts and the following shows up in the logs:

Jul 10 14:58:31 hostname orbit[12235]: I0710 14:58:31.508298 12378 extensions.cpp:348] Extension UUID 63345 has gone away

Following that, the only errors present in the logs are related to vtable construction.

While these errors are present, some fleetd tables will work fine, while others fail. Additional logs are available on request.

🧑‍💻  Steps to reproduce

  1. TODO
  2. TODO

🕯️ More info (optional)

N/A

QA notes

Testing fleetd change

Conditions to reproduce this bug (happens on Linux and macOS, Windows I was not able to reproduce there but it might also be affected):

The customer is not using Fleet Desktop so make sure to build without --fleet-desktop (This is not a condition but just to test closest to the customer setup)

After verifying the bugfix you will also need to verify the following on macOS/Linux/Windows (to check that we haven't broken existing functionality):

Testing fleet server fix #20620

Test with a fleetd in Linux/macOS/Windows.

First to reproduce:

  1. Fleet running with FLEET_OSQUERY_ENROLL_COOLDOWN=5m (and without the fix in https://github.com/fleetdm/fleet/pull/20620)
  2. Enroll/install fleetd
  3. Wait 5 minutes
  4. Uninstall fleetd
  5. Install fleetd again
  6. You will see the following errors (which are incorrect because the the cooldown already over):
    level=error ts=2024-07-19T19:36:02.053332Z component=http user=unauthenticated method=POST uri=/api/v1/osquery/enroll took=5.099692ms hostIdentifier=589966AE-074A-503B-B17B-54B05684A120 err="save enroll failed: host identified by 589966AE-074A-503B-B17B-54B05684A120 enrolling too often"

    And then after 5m of retrying it will eventually enroll (but this is far from ideal).

Now try the same steps again with the fix included: https://github.com/fleetdm/fleet/pull/20620. And there should be no enrolling too often error and fleetd should work as usual.

yoderme commented 1 month ago

Example lightly redacted log file:

# journalctl -u orbit | head -n 35
Jul 10 14:51:50 redacted systemd[1]: Started Orbit osquery.
Jul 10 14:51:51 redacted orbit[11692]: 2024-07-10T14:51:51-07:00 INF hash(orbit)=01aa7541840930e715cf6f01c182925d15edd05c98fd5049dfe19c45a6e1dd32fb48f288e75295b643da10c279b32c1326e56b9d82b7661fa276b3f4ee165d90
Jul 10 14:51:52 redacted orbit[11692]: 2024-07-10T14:51:52-07:00 INF hash(osqueryd)=b60cb453cc14e30590eea0e49aae37307df3a5b1ed72185af8d0b904a319e59ca3ab0b9e8b8fcd03de66c4f2953895016ff1224e5745a46bb672270ec9904467
Jul 10 14:51:52 redacted orbit[11692]: 2024-07-10T14:51:52-07:00 INF orbit version: 1.27.0
Jul 10 14:51:52 redacted orbit[11692]: 2024-07-10T14:51:52-07:00 INF Found osquery version: 5.12.1
Jul 10 14:51:57 redacted orbit[11692]: 2024-07-10T14:51:57-07:00 ERR interrupt serviceChecker
Jul 10 14:51:57 redacted orbit[11692]: 2024-07-10T14:51:57-07:00 ERR interrupt updater
Jul 10 14:51:57 redacted orbit[11692]: 2024-07-10T14:51:57-07:00 ERR interrupt config receivers
Jul 10 14:51:57 redacted orbit[11692]: 2024-07-10T14:51:57-07:00 ERR interrupt osquery
Jul 10 14:51:57 redacted orbit[11692]: 2024-07-10T14:51:57-07:00 ERR interrupt osquery extension
Jul 10 14:51:57 redacted orbit[11692]: 2024-07-10T14:51:57-07:00 INF start osqueryd cmd="/opt/orbit/bin/osqueryd/linux/stable/osqueryd --pidfile=/opt/orbit/osquery.pid --extensions_socket=/opt/orbit/orbit-osquery.em --logger_path=/opt/orbit/osquery_log --enroll_secret_env ENROLL_SECRET --tls_hostname=redacted.cloud.fleetdm.com --enroll_tls_endpoint=/api/v1/osquery/enroll --config_plugin=tls --config_tls_endpoint=/api/v1/osquery/config --config_refresh=60 --disable_distributed=false --distributed_plugin=tls --distributed_tls_max_attempts=10 --distributed_tls_read_endpoint=/api/v1/osquery/distributed/read --distributed_tls_write_endpoint=/api/v1/osquery/distributed/write --logger_plugin=tls,filesystem --logger_tls_endpoint=/api/v1/osquery/log --disable_carver=false --carver_disable_function=false --carver_start_endpoint=/api/v1/osquery/carve/begin --carver_continue_endpoint=/api/v1/osquery/carve/block --carver_block_size=8000000 --tls_server_certs /opt/orbit/certs.pem --augeas_lenses /opt/orbit/lenses --force --flagfile /opt/orbit/osquery.flags --host-identifier uuid --database_path /opt/orbit/osquery.db"
Jul 10 14:51:57 redacted osqueryd[12233]: osqueryd started [version=5.12.1]
Jul 10 14:51:57 redacted orbit[12235]: W0710 14:51:57.452004 12235 tls_enroll.cpp:101] Failed enrollment request to https://redacted.cloud.fleetdm.com/api/v1/osquery/enroll (No node key returned from TLS enroll plugin) retrying...
Jul 10 14:51:58 redacted orbit[12235]: W0710 14:51:58.772310 12235 tls_enroll.cpp:101] Failed enrollment request to https://redacted.cloud.fleetdm.com/api/v1/osquery/enroll (No node key returned from TLS enroll plugin) retrying...
Jul 10 14:52:03 redacted orbit[12235]: W0710 14:52:03.084205 12235 tls_enroll.cpp:101] Failed enrollment request to https://redacted.cloud.fleetdm.com/api/v1/osquery/enroll (No node key returned from TLS enroll plugin) retrying...
Jul 10 14:52:12 redacted orbit[12235]: W0710 14:52:12.386912 12235 tls_enroll.cpp:101] Failed enrollment request to https://redacted.cloud.fleetdm.com/api/v1/osquery/enroll (No node key returned from TLS enroll plugin) retrying...
Jul 10 14:52:28 redacted orbit[12235]: W0710 14:52:28.695900 12235 tls_enroll.cpp:101] Failed enrollment request to https://redacted.cloud.fleetdm.com/api/v1/osquery/enroll (No node key returned from TLS enroll plugin) retrying...
Jul 10 14:52:53 redacted orbit[12235]: W0710 14:52:53.995360 12235 tls_enroll.cpp:101] Failed enrollment request to https://redacted.cloud.fleetdm.com/api/v1/osquery/enroll (No node key returned from TLS enroll plugin) retrying...
Jul 10 14:53:30 redacted orbit[12235]: W0710 14:53:30.297829 12235 tls_enroll.cpp:101] Failed enrollment request to https://redacted.cloud.fleetdm.com/api/v1/osquery/enroll (No node key returned from TLS enroll plugin) retrying...
Jul 10 14:54:19 redacted orbit[12235]: W0710 14:54:19.594986 12235 tls_enroll.cpp:101] Failed enrollment request to https://redacted.cloud.fleetdm.com/api/v1/osquery/enroll (No node key returned from TLS enroll plugin) retrying...
Jul 10 14:55:23 redacted orbit[12235]: W0710 14:55:23.884869 12235 tls_enroll.cpp:101] Failed enrollment request to https://redacted.cloud.fleetdm.com/api/v1/osquery/enroll (No node key returned from TLS enroll plugin) retrying...
Jul 10 14:56:45 redacted orbit[12235]: W0710 14:56:45.190604 12235 tls_enroll.cpp:101] Failed enrollment request to https://redacted.cloud.fleetdm.com/api/v1/osquery/enroll (No node key returned from TLS enroll plugin) retrying...
Jul 10 14:58:25 redacted orbit[12235]: I0710 14:58:25.492630 12381 interface.cpp:137] Registering extension (com.fleetdm.orbit.osquery_extension.v1, 63345, version=, sdk=)
Jul 10 14:58:25 redacted orbit[12235]: I0710 14:58:25.783989 12235 eventfactory.cpp:352] The minimum events expiration timeout for user_events has been adjusted: 259260
Jul 10 14:58:31 redacted orbit[12235]: I0710 14:58:31.508298 12378 extensions.cpp:348] Extension UUID 63345 has gone away
Jul 10 15:25:40 redacted orbit[12235]: E0710 15:25:40.106655 20259 distributed.cpp:187] Error executing distributed query: fleet_additional_query_chef_policy_name: vtable constructor failed: parse_json
Jul 10 15:25:40 redacted orbit[12235]: E0710 15:25:40.480124 20259 distributed.cpp:187] Error executing distributed query: fleet_label_query_19: vtable constructor failed: parse_json
Jul 10 15:25:40 redacted orbit[12235]: E0710 15:25:40.480571 20259 distributed.cpp:187] Error executing distributed query: fleet_label_query_20: vtable constructor failed: parse_json
Jul 10 15:25:40 redacted orbit[12235]: E0710 15:25:40.481002 20259 distributed.cpp:187] Error executing distributed query: fleet_label_query_21: vtable constructor failed: parse_json
Jul 10 15:49:25 redacted orbit[12235]: I0710 15:49:25.103096 20260 query.cpp:119] Storing initial results for new scheduled query: pack/Global/Docker Image Information - 2024 Feb 5 - 01
Jul 10 16:26:56 redacted orbit[12235]: E0710 16:26:56.458920 20259 distributed.cpp:187] Error executing distributed query: fleet_additional_query_chef_policy_name: vtable constructor failed: parse_json
Jul 10 16:26:56 redacted orbit[12235]: E0710 16:26:56.848855 20259 distributed.cpp:187] Error executing distributed query: fleet_label_query_19: vtable constructor failed: parse_json
Jul 10 16:26:56 redacted orbit[12235]: E0710 16:26:56.849300 20259 distributed.cpp:187] Error executing distributed query: fleet_label_query_20: vtable constructor failed: parse_json
Jul 10 16:26:56 redacted orbit[12235]: E0710 16:26:56.849849 20259 distributed.cpp:187] Error executing distributed query: fleet_label_query_21: vtable constructor failed: parse_json
Jul 10 17:28:37 redacted orbit[12235]: E0710 17:28:37.250033 20259 distributed.cpp:187] Error executing distributed query: fleet_additional_query_chef_policy_name: vtable constructor failed: parse_json
yoderme commented 1 month ago

Other random info

lucasmrod commented 1 month ago

Hi @yoderme! Thanks for the info. A few questions:

Machines are re-imaged frequently. In some environments a third of the machines will fail like this and the others will be fine. Machines that show this failure have later not shown this failure after re-imaging.

sharon-fdm commented 1 month ago

Hey team! Please add your planning poker estimate with Zenhub @getvictor @jacobshandling @lucasmrod @mostlikelee @RachelElysia

lucasmrod commented 1 month ago

@xpkoala @PezHub I've added QA notes to the description.

lucasmrod commented 1 month ago

@ksatter Once the fix is released to stable (sometime next week), orbit in the affected hosts will still need to be restarted. Because the bug prevented the auto-update sub-system from running. Once restarted they will auto-update.

lukeheath commented 1 month ago

@ksatter Reminder to use the fast track for Fleeties when reporting bugs. Thanks!

JoStableford commented 1 month ago

Related to a Slack conversation

fleet-release commented 1 month ago

Fleetd queries fixed, Smooth as a cloud city's glass, No more errors missed.