[osquery 5.11] macOS keychain corrupted after querying the `certificates` table

ksatter commented 1 year ago

UPDATE: In #14126 the Fleet team wrote a script that sent 30k requests to this endpoint and were unable to cause/reproduce a corruption. We're going to move forward w/ this solution:

Part 1

Identify the forbidden and allowed APIs and what data would be available if we used a new API
Once we've identified the new API, align with the team on what the limitations of the new API are

Part 2

If we decide new API is useful, then:

Implement a fix to the certificates table in osquery to use the new macOS API to get this data

Fleet version: 4.34.1 osquery version: 5.8.2

🧑‍💻 Expected behavior

When running a query against the certificates table, I do expect the macOS keychain to not be corrupted.

💥 Actual behavior

Querying the certificates table on macOS in some cases causes the user's keychain to become corrupted.

👣 Reproduction steps

This is not a consistent issue and we have not been able to personally reproduce it, but we have two reports from customers.

Running a query using the certificates table causes this issue on a small percentage of hosts, and removing that query resolves this issue.

More info

This appears to be related to macOS changes implemented on the path to moving away from filebased keychains
Related bug tickets in osquery
- https://github.com/osquery/osquery/issues/7780
- My take away from this, is that at some point, our keychain tables (keychain_items and certificates) are on the way out. Our usage is inherently file-based, and that's all deprecated.
- https://github.com/osquery/osquery/issues/7800
- We are seeing occasional corruption of Login Keychains across a fleet of 1000s of Mac OS devices. We have not observed any correlation (or significant occurence) of System Keychain corruption.
- Our baseline was to be running SELECT * from certificates every 60 seconds as part of a device health compliance system. With this in-place as a query_pack we see ~0.3% of assets experience Login Keychain corruption every week, with the systems >experiencing the bad event tending to experience it on an ongoing recurrent basis. There is a correlation with older hardware >experiencing the brunt of the impact; newer systems are less likely to be affected.
Other potentially affected tables:
- keychain_acls
- keychain_items
- These were the tables found when searching the osquery repo for instances of the deprecated api calls that a user reported on in the osquery bug tickets. Those calls are, SecKeychainOpen and SecKeychainGetPath

sharon-fdm commented 1 year ago

@xpkoala Is this something you could help reproducing?

ksatter commented 1 year ago

@dherder Is also attempting to reproduce.

xpkoala commented 1 year ago

Summary: I don't believe this is something that needs QA's help at the moment. This appears to be a reliably reproducible issue that can occur when using the certificates and keychain_items tables via osquery. I imagine @lucasmrod or @zwass would be the best people to continue research and working with the osquery team on a fix.

A few notes after reading the linked documents:

This is an osquery issue and not something that is confined to Fleet.
keychain_items and certificates are most likely both affected by a change in how the mac keychain now operates.
This issue is still being researched and there is no determination on how to fix the issue.
A data point on this occurring from one user when running a query that includes SELECT * from certificates on a 1 minute frequency against 1,000 hosts: "we see ~0.3% of assets experience Login Keychain corruption every week"
When this occurs the keychain becomes corrupted and a new keychain is created upon restart causing a loss of all saved passwords and credentials.

Apple's current response on the reported issue, "an abnormal amount of reads may lead to corruption of the keychain" link

Links pulled from the above tickets:

dherder commented 1 year ago

I have the query SELECT * from certificates; executing every 60 seconds on a test vm macos 13.4.1. I haven't noticed any keychain corruption as of yet, but there could be one other variable at play. Perhaps the login keychain would need to be accessed with some other process (like a user certificate being leveraged for wifi access, etc).

zhumo commented 1 year ago

one potential solution mentioned here: https://github.com/osquery/osquery/issues/7780#issuecomment-1276415202, is to implement a caching layer. This might significantly reduce the number of corruptions, which might be good enough.

sharon-fdm commented 1 year ago

Task: 1 - Assign someone to investigate whether we can do a midterm implementation in fleetd. The new table will use newer MAC APIs that will have "less power" (Similar but not include all original info) 2 - when we hjave the initial investigation results, consider whether we actually want the new table.

zayhanlon commented 1 year ago

@sharon-fdm Any status update on getting this into next sprint maybe?

sharon-fdm commented 1 year ago

@zayhanlon It's in our bugs backlog. @zhumo please advise if it's prioritized properly.

zayhanlon commented 1 year ago

Okay - no problem! I thought we had it marked critical and I just remembered we downgraded.

zhumo commented 1 year ago

Hi @zayhanlon, we believe this is caused by an Apple API. They acknowledge in their documentation that rapid usage of that API endpoint will cause this issue. As such, we recommend that the customer use this table sparingly. I'd recommend not using this table to check the certs once every minute.

That said, the work that this issue reflects will be to switch this table to use Apple's new API which does not have this corruption issue. However, we know that the new APIs have less information, so it may not be useful to the customer.

sharon-fdm commented 1 year ago

If the customer uses Fleetd we can do it as an extension. Otherwise, we should go with an osquery core fix.

mikermcneil commented 1 year ago

Otherwise, we should go with an osquery core fix.

Let's go straight to the source and fix the root cause in osquery. That doesn't mean we have to "fix" everything-- just that we need to fix the broken user experience. For example, a valid solution could be to make this fail if you try to query it more than once per minute.

Thanks @zhumo for the mitigation of adding this to the docs in the meantime: https://github.com/fleetdm/fleet/pull/13975

noahtalerman commented 1 year ago

Mo: Instead of fixing the table, for this issue, come to confidence about a rate of corruption and under what conditions (number of requests).

1 host, 1 request per second and see at what number of requests the corruption occurs. TODO

noahtalerman commented 1 year ago

@georgekarrv can you please file a new testing issue to come to confidence about the rate of corruption? Please @ Isabell that this is a CX bug that the MDM team is helping out with when you file the new ticket. Thanks!

zhumo commented 1 year ago

Removing the bug label. We are testing the system to determine the true rate of corruption and adding that to the table documentation. Researching & implementing new tables as described here should be considered a new feature. @ksatter

lukeheath commented 11 months ago

@noahtalerman I am moving this bug back to product drafting. In #14126 we wrote a script that sent 30k requests to this endpoint and were unable to cause a corruption. It sounds like we should move forward with the approach Mike outlined above:

Let's go straight to the source and fix the root cause in osquery. That doesn't mean we have to "fix" everything-- just that we need to fix the broken user experience. For example, a valid solution could be to make this fail if you try to query it more than once per minute.

noahtalerman commented 11 months ago

@lukeheath thank!

Please add the :product label + either g-cx or g-mdm label and assign me when you move a bug back to product drafting. This way, the bug ends up on the product drafting board and doesn't get lost.

getvictor commented 10 months ago

@ksatter The PR for osquery speculative fix is here: https://github.com/osquery/osquery/pull/8192

How are the customers waiting for this fix deploying osquery -- osqueryd or fleetd? Do they want to use/test the fix, or wait for the next osquery release, which may not be until next year.

Also, this fix is speculative. There is a chance it might not work, and we may need to do a different fix.

ksatter commented 10 months ago

@getvictor There's a mix, but primarily using fleetd.

sharon-fdm commented 10 months ago

@zayhanlon @noahtalerman This was fix in osquery core and in review now. Once approved we will take the new osq and close this issue

fleet-release commented 9 months ago

API shifts in the breeze, Keychains safe, data at ease, Queries find their peace.

lukeheath commented 8 months ago

@sharon-fdm @noahtalerman When should we close this ticket? Do we want to wait until this version of osquery is deployed to our TUF server?

noahtalerman commented 8 months ago

Hmm, I think it would be best for our customers/community if we wait until osquery 5.11 is on the stable channel to close this.

I don't think we do this in the current process: wait to close bugs that require osquery changes until we ship osquery to stable.

@lukeheath maybe we do want to use milestones for osquery versions (in addition to Orbit).

That way, you could send bugs that require osquery changes back to the confirm and celebrate column in the drafting board so they could go through the confirm and celebrate process: wait to close until we ship to stable.

What do you thunk?

lucasmrod commented 8 months ago

Hmm, I think it would be best for our customers/community if we wait until osquery 5.11 is on the stable channel to close this.

osquery 5.11.0 is on the stable channel (as of Friday the 2nd).

lukeheath commented 8 months ago

@noahtalerman Sounds like this is ready to close. I'm game for creating milestones. To be clear, you want osquery bugs to go back to drafting for confirm and celebrate?

fleet-release commented 8 months ago

New API in sight, Mac's keychain glows bright, Fleet ensures peace at night.

noahtalerman commented 8 months ago

To be clear, you want osquery bugs to go back to drafting for confirm and celebrate?

@lukeheath yes, I think that makes sense. Added a note to our scrum process call to discuss.

fleetdm / fleet