Open ksatter opened 2 months ago
note: in order for this to happen I believe it must be the case that the computers in this environment are bound to the LDAP in such a way that the users table looks up all users in the org.
TODO: @lucasmrod will set a tech discussion/slack thread to agree on solution and estimation.
We'd need to configure an LDAP server on a linux host and check if there's a way to filter out these non-local accounts. If there's no way to do it from the query, we might need osquery changes to allow filtering them out.
This is exactly what I was thinking @lucasmrod - there is some range of UID that is for local vs. service vs. ??? (e.g., over 1000) or maybe there is some way to whitelist actual local accounts (like a UID at the directory level vs. at the local level. But, it would be logical to also leave in a way to get "all" with some kind of warning that IF you're bound to a driectory service, all MEANS ALL.
Estimation is for reproduction only. (Build whatever is needed)
@noahtalerman @nonpunctual
To summarize the discussions with the customer, their issue with the "users" and "software" queries is:
To gather "users" and "software" on linux devices we SELECT
the users
table.
Usually when querying SELECT * FROM users
on a linux device, the machine reads the users from /etc/passwd
, which is not an expensive operation.
But the customer has configured their linux hosts to use an LDAP directory for authentication. This means that when querying SELECT * FROM users
the hosts are querying the whole directory (not local users, but ALL users). Which led to the following issue, quote from the customer:
[...] bad event happened because 160,000 hosts each requested 21,000 users every X minutes in a
query_pack
.
To fix the "users" and "software" query on hosts that use an LDAP directory for authentication I can see the following options so far:
A. Change "users" and "software" queries for all linux hosts to return only users that have logged in (by filtering with the osquery last table). Good: Solution should be somewhat simple. Bad: This would cause all linux devices to not return users that have never logged in (even if they exist in /etc/passwd).
B. Detect in some way when linux hosts use an LDAP directory for authentication. I haven't found anything that vanilla osquery can provide to detect this, maybe we can do something via fleetd tables. But obviously, if we use fleetd tables then this solution won't work for vanilla osquery deployments.
Good: We don't change the behavior of linux hosts that do not use LDAP directory for authentication.
Bad: I still have to find a good way to determine if a host uses LDAP for authentication. E.g. use file_lines
to process some specific files. (Not simple solution AFAICS.)
C. Add a limit, e.g. 100
to reduce the number of users that are processed every time SELECT * FROM users
is executed. Would not solve the issue but would reduce the load to the customer's LDAP directory ~200-fold.
--
I'm more on the side of (A), but given it has some impact on non-LDAP linux hosts we may want to discuss it.
--
Other things I've tried/considered:
shell_history
requires JOINing with users
to work, so suffers from the same issue (the hosts would be querying ALL users)./home/%
directory. It could work but users can change their /home
directory to other locations by setting $HOME
...good analysis @lucasmrod!
https://www.baeldung.com/linux/user-ids-reserved-values Have you looked into this? Basically 3 UID ranges:
system users = 0-99 application users = 100-999 regular users = 1000-9999
It will not be totally reliable, because an admin user can reassign their own UID, but, for the most part, I think we can rely on users having UID of +1000 but in the lower ranges (i.e., 1st local user gets 1000, 2nd gets 1001, etc.) Anything outside of these practical ranges could be ignored?
Some combination of this reality & solution C seems like the way to me.
For solution B I think we would almost have to be looking at log events or tcp / packet which seems expensive. https://www.baeldung.com/linux/ldap-command-line-authentication this article is kind of interesting & might point you towards some logged events (auth) that could be monitored.
Thanks @lucasmrod! Looking forward to digging into this during our call tomorrow.
It will not be totally reliable, because an admin user can reassign their own UID, but, for the most part, I think we can rely on users having UID of +1000 but in the lower ranges (i.e., 1st local user gets 1000, 2nd gets 1001, etc.) Anything outside of these practical ranges could be ignored?
The problem is that to get the uids in the first place you have to list the users (which means ldap requests).
C. Add a limit, e.g. 100 to reduce the number of users that are processed every time SELECT * FROM users is executed. Would not solve the issue but would reduce the load to the customer's LDAP directory ~200-fold.
A problem I'm seeing now with solution C, and the uid-based
one is that, say, SELECT * FROM users LIMIT 100;
will return 100 users that may not be local users in the host we are querying...
For solution B I think we would almost have to be looking at log events or tcp / packet which seems expensive. https://www.baeldung.com/linux/ldap-command-line-authentication this article is kind of interesting & might point you towards some logged events (auth) that could be monitored.
I'll keep looking but there doesn't seem to be a clean way to find out.
Dummy diagram of what's going on with the users
table with Linux hosts that use LDAP for authentication:
graph TB;
fleet_server["Fleet server"];
ldap_directory["LDAP directory server 🔥<br>20k users"];
hostA["Linux Host A<br>(used by 4 users)"];
hostB["Linux Host B<br>(used by 2 users)"];
hostC["Linux Host C<br>(used by no users)"];
fleet_server -- "SELECT * FROM users; (twice every 1h)" --> hostA;
hostA -- 20k LDAP requests --> ldap_directory;
fleet_server -- "SELECT * FROM users; (twice every 1h)" --> hostB;
hostB -- 20k LDAP requests --> ldap_directory;
fleet_server -- "SELECT * FROM users; (twice every 1h)" --> hostC;
hostC -- 20k LDAP requests --> ldap_directory;
@nonpunctual @noahtalerman As discussed:
features.detail_query_overrides
).Hey @lucasmrod, I think the plan you, Brock, and I decided to move forward with was slightly different? (from Google doc here)
Thinking was we can move faster for our customers by adding a new local_users
table to fleetd and/or osquery. Use this table for host vitals instead.
I'm onboard with your plan to update the users
table itself. Maybe @zwass can help us move the change through the steering committee?
But, if this change is slow moving in osquery then I think we should fall back to what you Brock and I decided.
Yes, I forgot to add the option of a separate table local_users
with the behavior. (Adding that to the osquery issue now.)
The one problem with the local_users
table (that I found after our discussion) is that it will not have the JOIN
semantics needed/used in other tables (there's a lot of osquery code that uses the users
table for the JOIN
capabilities). Such semantics could be added to the local_users
table but it will be no small change in the osquery code base AFAICS.
Maybe @zwass can find a workaround like aliasing? Somethingl like SELECT * FROM (SELECT * FROM local_users) users JOIN chrome_extensions [...]
Such semantics could be added to the
local_users
table but it will be no small change in the osquery code base AFAICS.
Ah, I see. Let's see what other folks in the osquery community think.
@lucasmrod maybe you could hop into the osquery office hours next week? To help drive us towards a decision. If you're free.
maybe you could hop into the osquery office hours next week? To help drive us towards a decision. If you're free.
Will do.
Hey @lucasmrod, did you make it to osquery office hours?
Sorry for not posting the update here:
The decision in the osquery office hours was to make the users
table implementation match the documentation "Local user accounts"
(https://osquery.io/schema/5.12.1/#users). Which means such table won't return "LDAP" users, only local users defined in /etc/passwd
(and therefore querying this table won't cause issue with LDAP requests hammering LDAP directories).
So I'm currently working on making these changes to hopefully include them as part of 5.13.0. /cc @zwass
Hi folks! This change (https://github.com/osquery/osquery/issues/8337) will be released in osquery 5.13.0 (sometime ~July).
Fleet version: v4.48.2
Web browser and operating system:
💥 Actual behavior
When gathering caching users, all users are cached, including LDAP users on Linux hosts:
As a result, a query that utilizes cached users (Like vscode_extensions) is retrieving all LDAP users, resulting in a large amount of traffic to the LDAP directory.
This was noticed when osquery generated a large number of INFO logs related to vscode_extensions, one for each user in the LDAP directory.
TODO
🧑💻 Steps to reproduce
🕯️ More info (optional)
N/A