Closed ksatter closed 8 months ago
@xpkoala Is this something you could help reproducing?
@dherder Is also attempting to reproduce.
Summary: I don't believe this is something that needs QA's help at the moment. This appears to be a reliably reproducible issue that can occur when using the certificates
and keychain_items
tables via osquery. I imagine @lucasmrod or @zwass would be the best people to continue research and working with the osquery team on a fix.
A few notes after reading the linked documents:
keychain_items
and certificates
are most likely both affected by a change in how the mac keychain now operates.SELECT * from certificates
on a 1 minute frequency against 1,000 hosts: "we see ~0.3% of assets experience Login Keychain corruption every week"Apple's current response on the reported issue, "an abnormal amount of reads may lead to corruption of the keychain" link
Links pulled from the above tickets:
I have the query SELECT * from certificates;
executing every 60 seconds on a test vm macos 13.4.1. I haven't noticed any keychain corruption as of yet, but there could be one other variable at play. Perhaps the login keychain would need to be accessed with some other process (like a user certificate being leveraged for wifi access, etc).
one potential solution mentioned here: https://github.com/osquery/osquery/issues/7780#issuecomment-1276415202, is to implement a caching layer. This might significantly reduce the number of corruptions, which might be good enough.
Task: 1 - Assign someone to investigate whether we can do a midterm implementation in fleetd. The new table will use newer MAC APIs that will have "less power" (Similar but not include all original info) 2 - when we hjave the initial investigation results, consider whether we actually want the new table.
@sharon-fdm Any status update on getting this into next sprint maybe?
@zayhanlon It's in our bugs backlog. @zhumo please advise if it's prioritized properly.
Okay - no problem! I thought we had it marked critical and I just remembered we downgraded.
Hi @zayhanlon, we believe this is caused by an Apple API. They acknowledge in their documentation that rapid usage of that API endpoint will cause this issue. As such, we recommend that the customer use this table sparingly. I'd recommend not using this table to check the certs once every minute.
That said, the work that this issue reflects will be to switch this table to use Apple's new API which does not have this corruption issue. However, we know that the new APIs have less information, so it may not be useful to the customer.
If the customer uses Fleetd we can do it as an extension. Otherwise, we should go with an osquery core fix.
Otherwise, we should go with an osquery core fix.
Let's go straight to the source and fix the root cause in osquery. That doesn't mean we have to "fix" everything-- just that we need to fix the broken user experience. For example, a valid solution could be to make this fail if you try to query it more than once per minute.
Thanks @zhumo for the mitigation of adding this to the docs in the meantime: https://github.com/fleetdm/fleet/pull/13975
Mo: Instead of fixing the table, for this issue, come to confidence about a rate of corruption and under what conditions (number of requests).
1 host, 1 request per second and see at what number of requests the corruption occurs. TODO
@georgekarrv can you please file a new testing issue to come to confidence about the rate of corruption? Please @ Isabell that this is a CX bug that the MDM team is helping out with when you file the new ticket. Thanks!
Removing the bug label. We are testing the system to determine the true rate of corruption and adding that to the table documentation. Researching & implementing new tables as described here should be considered a new feature. @ksatter
@noahtalerman I am moving this bug back to product drafting. In #14126 we wrote a script that sent 30k requests to this endpoint and were unable to cause a corruption. It sounds like we should move forward with the approach Mike outlined above:
Let's go straight to the source and fix the root cause in osquery. That doesn't mean we have to "fix" everything-- just that we need to fix the broken user experience. For example, a valid solution could be to make this fail if you try to query it more than once per minute.
@lukeheath thank!
Please add the :product
label + either g-cx
or g-mdm
label and assign me when you move a bug back to product drafting. This way, the bug ends up on the product drafting board and doesn't get lost.
@ksatter The PR for osquery speculative fix is here: https://github.com/osquery/osquery/pull/8192
How are the customers waiting for this fix deploying osquery -- osqueryd or fleetd? Do they want to use/test the fix, or wait for the next osquery release, which may not be until next year.
Also, this fix is speculative. There is a chance it might not work, and we may need to do a different fix.
@getvictor There's a mix, but primarily using fleetd.
@zayhanlon @noahtalerman This was fix in osquery core and in review now. Once approved we will take the new osq and close this issue
API shifts in the breeze, Keychains safe, data at ease, Queries find their peace.
@sharon-fdm @noahtalerman When should we close this ticket? Do we want to wait until this version of osquery is deployed to our TUF server?
Hmm, I think it would be best for our customers/community if we wait until osquery 5.11 is on the stable
channel to close this.
I don't think we do this in the current process: wait to close bugs that require osquery changes until we ship osquery to stable
.
@lukeheath maybe we do want to use milestones for osquery versions (in addition to Orbit).
That way, you could send bugs that require osquery changes back to the confirm and celebrate column in the drafting board so they could go through the confirm and celebrate process: wait to close until we ship to stable
.
What do you thunk?
Hmm, I think it would be best for our customers/community if we wait until osquery 5.11 is on the stable channel to close this.
osquery 5.11.0 is on the stable
channel (as of Friday the 2nd).
@noahtalerman Sounds like this is ready to close. I'm game for creating milestones. To be clear, you want osquery bugs to go back to drafting for confirm and celebrate?
New API in sight, Mac's keychain glows bright, Fleet ensures peace at night.
To be clear, you want osquery bugs to go back to drafting for confirm and celebrate?
@lukeheath yes, I think that makes sense. Added a note to our scrum process call to discuss.
UPDATE: In #14126 the Fleet team wrote a script that sent 30k requests to this endpoint and were unable to cause/reproduce a corruption. We're going to move forward w/ this solution:
Part 1
Part 2
If we decide new API is useful, then:
certificates
table in osquery to use the new macOS API to get this dataFleet version: 4.34.1 osquery version: 5.8.2
🧑💻 Expected behavior
When running a query against the
certificates
table, I do expect the macOS keychain to not be corrupted.💥 Actual behavior
Querying the
certificates
table on macOS in some cases causes the user's keychain to become corrupted.👣 Reproduction steps
This is not a consistent issue and we have not been able to personally reproduce it, but we have two reports from customers.
Running a query using the
certificates
table causes this issue on a small percentage of hosts, and removing that query resolves this issue.More info
This appears to be related to macOS changes implemented on the path to moving away from filebased keychains
Related bug tickets in osquery
https://github.com/osquery/osquery/issues/7780
https://github.com/osquery/osquery/issues/7800
Other potentially affected tables:
keychain_acls
keychain_items
SecKeychainOpen
andSecKeychainGetPath