fleetdm / fleet

Open-source platform for IT, security, and infrastructure teams. (Linux, macOS, Chrome, Windows, cloud, data center)
https://fleetdm.com
Other
2.67k stars 379 forks source link

[osquery 5.13] Linux: When caching users, all LDAP users are returned, causing performance issues in LDAP directory. #18343

Open ksatter opened 2 months ago

ksatter commented 2 months ago

Fleet version: v4.48.2

Web browser and operating system:


💥  Actual behavior

When gathering caching users, all users are cached, including LDAP users on Linux hosts:

WITH cached_groups AS (select * from groups)
 SELECT uid, username, type, groupname, shell
 FROM users LEFT JOIN cached_groups USING (gid)
 WHERE type <> 'special' AND shell NOT LIKE '%/false' AND shell NOT LIKE '%/nologin' AND shell NOT LIKE '%/shutdown' AND shell NOT LIKE '%/halt' AND username NOT LIKE '%$' AND username NOT LIKE '\_%' ESCAPE '\' AND NOT (username = 'sync' AND shell ='/bin/sync' AND directory <> '')

As a result, a query that utilizes cached users (Like vscode_extensions) is retrieving all LDAP users, resulting in a large amount of traffic to the LDAP directory.

This was noticed when osquery generated a large number of INFO logs related to vscode_extensions, one for each user in the LDAP directory.

TODO

🧑‍💻  Steps to reproduce

  1. TODO
  2. TODO

🕯️ More info (optional)

N/A

nonpunctual commented 2 months ago

note: in order for this to happen I believe it must be the case that the computers in this environment are bound to the LDAP in such a way that the users table looks up all users in the org.

sharon-fdm commented 2 months ago

TODO: @lucasmrod will set a tech discussion/slack thread to agree on solution and estimation.

lucasmrod commented 2 months ago

We'd need to configure an LDAP server on a linux host and check if there's a way to filter out these non-local accounts. If there's no way to do it from the query, we might need osquery changes to allow filtering them out.

nonpunctual commented 2 months ago

This is exactly what I was thinking @lucasmrod - there is some range of UID that is for local vs. service vs. ??? (e.g., over 1000) or maybe there is some way to whitelist actual local accounts (like a UID at the directory level vs. at the local level. But, it would be logical to also leave in a way to get "all" with some kind of warning that IF you're bound to a driectory service, all MEANS ALL.

sharon-fdm commented 2 months ago

Estimation is for reproduction only. (Build whatever is needed)

lucasmrod commented 2 months ago

@noahtalerman @nonpunctual

To summarize the discussions with the customer, their issue with the "users" and "software" queries is:

To gather "users" and "software" on linux devices we SELECT the users table.

Usually when querying SELECT * FROM users on a linux device, the machine reads the users from /etc/passwd, which is not an expensive operation.

But the customer has configured their linux hosts to use an LDAP directory for authentication. This means that when querying SELECT * FROM users the hosts are querying the whole directory (not local users, but ALL users). Which led to the following issue, quote from the customer:

[...] bad event happened because 160,000 hosts each requested 21,000 users every X minutes in a query_pack.

To fix the "users" and "software" query on hosts that use an LDAP directory for authentication I can see the following options so far:

A. Change "users" and "software" queries for all linux hosts to return only users that have logged in (by filtering with the osquery last table). Good: Solution should be somewhat simple. Bad: This would cause all linux devices to not return users that have never logged in (even if they exist in /etc/passwd).

B. Detect in some way when linux hosts use an LDAP directory for authentication. I haven't found anything that vanilla osquery can provide to detect this, maybe we can do something via fleetd tables. But obviously, if we use fleetd tables then this solution won't work for vanilla osquery deployments. Good: We don't change the behavior of linux hosts that do not use LDAP directory for authentication. Bad: I still have to find a good way to determine if a host uses LDAP for authentication. E.g. use file_lines to process some specific files. (Not simple solution AFAICS.)

C. Add a limit, e.g. 100 to reduce the number of users that are processed every time SELECT * FROM users is executed. Would not solve the issue but would reduce the load to the customer's LDAP directory ~200-fold.

--

I'm more on the side of (A), but given it has some impact on non-LDAP linux hosts we may want to discuss it.

--

Other things I've tried/considered:

nonpunctual commented 2 months ago

good analysis @lucasmrod!

https://www.baeldung.com/linux/user-ids-reserved-values Have you looked into this? Basically 3 UID ranges:

system users = 0-99 application users = 100-999 regular users = 1000-9999

It will not be totally reliable, because an admin user can reassign their own UID, but, for the most part, I think we can rely on users having UID of +1000 but in the lower ranges (i.e., 1st local user gets 1000, 2nd gets 1001, etc.) Anything outside of these practical ranges could be ignored?

Some combination of this reality & solution C seems like the way to me.

For solution B I think we would almost have to be looking at log events or tcp / packet which seems expensive. https://www.baeldung.com/linux/ldap-command-line-authentication this article is kind of interesting & might point you towards some logged events (auth) that could be monitored.

noahtalerman commented 2 months ago

Thanks @lucasmrod! Looking forward to digging into this during our call tomorrow.

lucasmrod commented 2 months ago

It will not be totally reliable, because an admin user can reassign their own UID, but, for the most part, I think we can rely on users having UID of +1000 but in the lower ranges (i.e., 1st local user gets 1000, 2nd gets 1001, etc.) Anything outside of these practical ranges could be ignored?

The problem is that to get the uids in the first place you have to list the users (which means ldap requests).

C. Add a limit, e.g. 100 to reduce the number of users that are processed every time SELECT * FROM users is executed. Would not solve the issue but would reduce the load to the customer's LDAP directory ~200-fold.

A problem I'm seeing now with solution C, and the uid-based one is that, say, SELECT * FROM users LIMIT 100; will return 100 users that may not be local users in the host we are querying...

For solution B I think we would almost have to be looking at log events or tcp / packet which seems expensive. https://www.baeldung.com/linux/ldap-command-line-authentication this article is kind of interesting & might point you towards some logged events (auth) that could be monitored.

I'll keep looking but there doesn't seem to be a clean way to find out.

lucasmrod commented 1 month ago

Dummy diagram of what's going on with the users table with Linux hosts that use LDAP for authentication:

graph TB;
    fleet_server["Fleet server"];
    ldap_directory["LDAP directory server 🔥<br>20k users"];
    hostA["Linux Host A<br>(used by 4 users)"];
    hostB["Linux Host B<br>(used by 2 users)"];
    hostC["Linux Host C<br>(used by no users)"];

    fleet_server -- "SELECT * FROM users; (twice every 1h)" --> hostA;
    hostA -- 20k LDAP requests --> ldap_directory;
    fleet_server -- "SELECT * FROM users; (twice every 1h)" --> hostB;
    hostB -- 20k LDAP requests --> ldap_directory;
    fleet_server -- "SELECT * FROM users; (twice every 1h)" --> hostC;
    hostC -- 20k LDAP requests --> ldap_directory;
lucasmrod commented 1 month ago

@nonpunctual @noahtalerman As discussed:

noahtalerman commented 1 month ago

Hey @lucasmrod, I think the plan you, Brock, and I decided to move forward with was slightly different? (from Google doc here)

Screenshot 2024-05-17 at 10 03 24 AM

Thinking was we can move faster for our customers by adding a new local_users table to fleetd and/or osquery. Use this table for host vitals instead.

I'm onboard with your plan to update the users table itself. Maybe @zwass can help us move the change through the steering committee?

But, if this change is slow moving in osquery then I think we should fall back to what you Brock and I decided.

lucasmrod commented 1 month ago

Yes, I forgot to add the option of a separate table local_users with the behavior. (Adding that to the osquery issue now.)

The one problem with the local_users table (that I found after our discussion) is that it will not have the JOIN semantics needed/used in other tables (there's a lot of osquery code that uses the users table for the JOIN capabilities). Such semantics could be added to the local_users table but it will be no small change in the osquery code base AFAICS. Maybe @zwass can find a workaround like aliasing? Somethingl like SELECT * FROM (SELECT * FROM local_users) users JOIN chrome_extensions [...]

noahtalerman commented 1 month ago

Such semantics could be added to the local_users table but it will be no small change in the osquery code base AFAICS.

Ah, I see. Let's see what other folks in the osquery community think.

@lucasmrod maybe you could hop into the osquery office hours next week? To help drive us towards a decision. If you're free.

lucasmrod commented 1 month ago

maybe you could hop into the osquery office hours next week? To help drive us towards a decision. If you're free.

Will do.

noahtalerman commented 1 month ago

Hey @lucasmrod, did you make it to osquery office hours?

lucasmrod commented 1 month ago

Sorry for not posting the update here: The decision in the osquery office hours was to make the users table implementation match the documentation "Local user accounts" (https://osquery.io/schema/5.12.1/#users). Which means such table won't return "LDAP" users, only local users defined in /etc/passwd (and therefore querying this table won't cause issue with LDAP requests hammering LDAP directories).

So I'm currently working on making these changes to hopefully include them as part of 5.13.0. /cc @zwass

lucasmrod commented 1 week ago

Hi folks! This change (https://github.com/osquery/osquery/issues/8337) will be released in osquery 5.13.0 (sometime ~July).