Open nieuwsma opened 2 years ago
Note: we should find out if there is a limitation on the number of accounts that can be created on the redfish device... its possible its a few dozen.
Clarification: the 'SNMP' to switch is only for management switches. The HSN switches (Rosetta, Cassini) use Redfish.
I really like this proposal @nieuwsma! However, I do think there is a piece of the puzzle is missing for it to be considered the 'ultimate end goal' for device credential management
.
x3000c0s26b0n0
and x3000c0s26e0
, and the discovery services created x3000c0s26b0
:
ncn-m001:~ # kubectl exec -it -n vault -c vault cray-vault-0 -- sh -c "export VAULT_ADDR=http://localhost:8200; vault kv list secret/hms-creds" | grep x3000c0s26
x3000c0s26b0
x3000c0s26b0n0
x3000c0s26e0
admn
user creds are stored under pdu-creds
outside the purview of HMS/SCSD and managed by RTS. The RTS redfish front end creds are stored under hms-credsProposed Schema for Objective 1
ncn-m001:~ # kubectl exec -it -n vault -c vault cray-vault-0 -- sh -c "export VAULT_ADDR=http://localhost:8200; vault kv get secret/hms-creds/x3000c0s26b0n0"
======== Data ========
Key Value
--- -----
Password BMC_ROOT_USER_PASSWORD_HERE
SNMPAuthPass n/a
SNMPPrivPass n/a
URL x3000c0s26b0/redfish/v1/Systems/BQWF72600597
Username root
Xname x3000c0s26b0n0
The URL present in Vault is available via HSM under componentEndpoints:
ncn-m001:~ # cray hsm inventory componentEndpoints describe x3000c0s26b0n0 --format json
{
"ID": "x3000c0s26b0n0",
"Type": "Node",
"RedfishType": "ComputerSystem",
"RedfishSubtype": "Physical",
"UUID": "1663ee00-0991-11e7-906e-00163566263e",
"OdataID": "/redfish/v1/Systems/BQWF72600597",
"RedfishEndpointID": "x3000c0s26b0",
"Enabled": true,
"RedfishEndpointFQDN": "x3000c0s26b0",
"RedfishURL": "x3000c0s26b0/redfish/v1/Systems/BQWF72600597",
"ComponentEndpointType": "ComponentEndpointComputerSystem",
"RedfishSystemInfo": {
<truncated>
}
}
Does this requirement include the management of default device credentials? Or is this proposal strictly for the management of device credentials post determining the device credentials when hardware discovery is performed?
Under Implementation Considerations -> Fidelity
states the following:
Under no circumstances should the system lose the ability to manage the devices
For context we currently keep track the of the following default credentials within the system in Vault
root
user root
useradmn
user.Here are scenarios that would prevent us from managing hardware. Granted there are a few scenarios here were we didn't "lose the ability" to manage hardware, as we never had it for new to the system hardware.
Types of defaults
root
credentials to the BMC.In the case of HPE PDUs and ServerTech PDUs with newer firmware an admin has to login using the well known default credentials, and then change the password away from the defaults for the user to become functional.
Having a place to get default credential information would be valuable for the discovery services to have a higher degree of success determining the correcting log in information of BMC, and reduce the admin and triage time required to bring a new piece of hardware into the system.
Great write-up, @nieuwsma.
Before I provide feedback in earnest, would you please entertain a few questions (realizing there is some overlap with threads from other reviewers)?
- Is there a way to constrain the number of human or machine principals that must have direct access to credentials? i.e., there are many services in this model that still must directly access the credential to do their job. This is exposure that we should try how to minimize. Consider the patterns that Hashicorp Boundary espouses here, but this could be something as simple as an API layer that provides session tokens vs. raw passwords as an iteration.
No, not really. All the HMS services need this type of access, and groups like slingshot need Redfish access as well. The session idea is interesting, but presents a scaling and IPC issue.
- Can you speak succinctly to the use case of zero-trust provisioning of devices in scope? Specifically, how does your proposal intersect or improve upon our capability to 'not ship default credentials' on impacted devices (including a sparing or default creds situation)?
Im not sure I understand the first part of your question. By centralizing this and creating the tool that effectively OWNs the creds, we can make sure that any 'baked' creds get changed ASAP. This proposal would allow the system to get off of default credentials quickly... maybe not before initial 'power on'; but quickly there after.
- Do you see a future state where access to devices in scope are managed via cryptographic identities vs. passwords?
Not from the HMS perspective honestly. We are still quite password bound. But if WE are the only spot that needs direct BMC access (except Slingshot) we reduce that exposure.
Proposal approved as it currently exists. Suggestions and feedback follow.
--
First, orthogonal to your proposal, I'm not a fan of design reviews in GitHub Issues. Perhaps we can switch to algol60-based Google Docs to keep the conversations public while moving to a more conducive forum (issues in GitHub for workload, design docs linked and published upon approval, ...)?
I suggest the proposal be renamed to reflect the focus on RedFish (or other suitable sub-domain) credential management strategy, for disambiguation.
I also suggest you pull someone in from the PET Team to reason about performance impacts to Vault. My guess is that this feature set could actually reduce calls to Vault, but it might be good to cover.
As follow up to our running threads:
- Do you see a future state where access to devices in scope are managed via cryptographic identities vs. passwords?
I know very little about RedFish, or the DMTF. Looking the DMTF Security Protocol and Data Model (SPDM) Specification at https://www.dmtf.org/sites/default/files/standards/documents/DSP0274_1.2.0.pdf, it appears mutual authentication via cryptographic identifies is in spec (see section 7.5). Put another way, a method for the requestor (e.g., a service in HMS) to prove to a RedFish-managed device it should be allowed access (authentication). I also acknowledge that everything you seek to manage may not 'speak' RedFish.
For multiple reasons, including those I'll touch upon in the 'zero trust' thread, I think we should be moving away from password-based authentication models. I'll otherwise digress in commentary here.
Can you speak succinctly to the use case of zero-trust provisioning of devices in scope? Specifically, how does your proposal intersect or improve upon our capability to 'not ship default credentials' on impacted devices (including a sparing or default creds situation)?
Thanks. So it doesn't really speak to zero trust from my perspective (as overloaded as I find this concept), but does speak to an ability to a) quickly reset default passwords and b) make sure default passwords don't sneak back in?
"Towards" zero trust, a representative architecture could include OEM/ODM trusted provisioning of cryptographic hardware identities -- this to prove to a platform (e.g., CSM) that the device has known provenance. Then, for the platform to prove itself to the device, the platform would need to present a cryptographic identity of its own, issued by a source (signed) that the hardware device trusts. There is a similar pattern in play for ~ TPM hardware.
Is there a way to constrain the number of human or machine principals that must have direct access to credentials? i.e., there are many services in this model that still must directly access the credential to do their job. This is exposure that we should try how to minimize. Consider the patterns that Hashicorp Boundary espouses here, but this could be something as simple as an API layer that provides session tokens vs. raw passwords as an iteration.
If I understand your proposal and our 1:1 discussion earlier this week, you're moving credential access into a single API. Notably as credentials could change at any time, I don't see how use of sessions vs. password distribution (to other services) differs in terms of scale? As discussed, there is a very compelling security reason not to broadly distribute creds that are not easily revoked to N services. I realize moving to anything other than passwords for all of these services is currently a huge lift, but please consider the use of sessions in your implementation, notably as this seems like a fair refactoring effort.
As you are moving to a single service for credential management, the implementation should include limiting access to Vault (for hardware credentials) to this service, and auditing functionality that expresses what services are requesting which credentials from this API, at what time, etc. This in your service, as Vault will only see access from it.
And some specific feedback towards your requirements table:
FR4 - This may be true for certain secret stores in Hashicorp Vault, but not all, and the use of Vault transcends the KV engine. Your point stands though, service principal access needs to be more granular, this both from a security and reliability perspective. This is true of workloads 'running in' Kubernetes, and current patterns in use around system management orchestration that powers our control and data planes.
FR5 - Auditing, and perhaps some capability to do credential escrow (save last N credentials), would be good points to speak to. Also, with respect to FR3, Vault should ideally also be leveraged for PRNG. This as applications should really not look to roll PRNG/crypto primitives/key generation on their own. This speaks to your 'high entropy' response to FR3 as part of objective 1, et al.
FR7 - As part of a future vision, I suggest adding that our implementation of Vault needs a hardware root of trust story. Today, the security of Vault generalizes to the security of CSM's Kubernetes configuration. For this proposal and as a general security primitive, our secret management solution needs to evolve along these lines.
Great write up.
Question:
Providing answers to come questions:
In terms of vault, I believe we are adequately protected in terms of backups, but a user create/update function should kick off an immediate backup. This should be relatively easy to build if we implement process via argo or into the update function itself.
In terms of additional load on vault, I believe we should be fine. Vault has been very solid since we moved to raft vs the etcd provided by our etcd operator. We also have plans to provide a faster storage tier if needed, but that will rely on our milan hardware spec or moving vault to master nodes. Either way I believe we are covered here.
Concerns:
- This may be my lack of knowledge of our complete hardware specific bmcs, but are we ruling out ldap authentication as a future evolution? This question is more to further my understanding. I know you stated "We are still quite password bound" but I am not sure if we have devices that just will never support ldap.
Yes, to my knowledge ldap is not supported on BMCs. Its basic auth and ssh keys only type of environment.
Concerns:
- Something that we need to call out is GB vs HPE nodes. HPE I believe has some credentials on a sticker on the system. I am not sure what we are doing this those credentials, but normally that would be what an on-site tech would be using when diagnosing hardware issues. Are we building in the ability to toggle that on/off?
Yes, we plan on 'burning' any default credential on the system. The only legitimate password will be the passwords we set.
Excellent!
To extend:
This is excellent!!! 👍
My only suggestion regarding the API would be to evaluate whether it'd make sense to add the capability to list all stored credentials in a way that permits the synchronisation with external password managers. This can be very useful for sites having personnel doing on-call interventions in the event that SCSD is unavailable.
I would like to see some detail around the expected handling of edge/corner cases (replaced/moved/reworked endpoints, etc) Steady state doesn't bother me, it is the less travelled paths that keep me awake at night for work like this.
Solutions are not required, but documenting minimal expected behaviour; ramifications for service events, catastrophic data loss; other gotchas; etc would help frame this proposal better.
Great callout Andy. This document is a high level design and I would ultimately expect the implementation team to really flesh that out, but Ill provide some comments.
I agree with your concern. What 'concerns' me the most is around the blade swap or 'unexpected' rediscovery scenario. I would expect that we would need to come up with a process, and wherever possible automation, to move credentials around. This is especially tricky if the CEC's havent been pre-programmed.
The negative effect of these processes going wrong is mostly debug gets harder, because it may take some time to realize that the expected credential is not accurate. The ramification is that potentially a manual, physical hardware recovery process might need to be undertaken to rectify the device credential. One-offs would be annoying but manageable, but 'loosing' credentials for a higher number of nodes would be probably exponentially more frustrating.
I think ultimately as we go towards implementation understanding the direct mfg and field impacts would be good.
Andrew Nieuwsma | 2022-05-04 | Initial Approval CASM Complete. Needs customer review and other internal review.
Abstract
This proposal outlines several strategic objectives to increase our security posture for 'management' hardware: BMCs, Switches, PDUs. The major focus of this proposal will impact BMCs most directly as they are the most numerous and are accessed most frequently. From this point on I will refer to all devices (PDU, switches, BMCs) simply as 'devices', I will clarify where necessary.
There is a large spectrum of possibilities for how we can manage device credentials. They have separate costs associated to implement and maintain and fit into a 'defense in depth' paradigm. This proposal suggests an 'ultimate end goal' for device credential management and offers direction options to maximize return on investment.
The following is a comparison between a desired end goal and current system maturity.
There is undoubtedly a huge work package for all of these items to progress to a 'desirable' future state.
Proposed Solutions
Objectives
I propose the following evolutionary progression to increase our security posture:
*note, I refer to SCSD in this document, SCSD is the current home of the 'change device credentials' logic. SCSD may not remain the home of the credentials API, but for now I will assume it shall.
OBJECTIVE 1:
CSM manages all device credentials.
To implement this, we would require the ability to
We should create automation around this functionality such that a password rotation utility could be run periodically or on-demand. This utility would be run during the installation process (and in short order after a factory reset of a device) to change the default passwords baked into the firmware. This limits the exposure time of default well known credentials. This functionality might be a separate credential daemon, or might be built into SCSD.
If this objective was met, the following functional requirements would be complete*:
OBJECTIVE 2:
Implement least privileges and RBAC
This is a very large lift and would require phases of development and deployment:
If this objective was met, the following functional requirements would be complete*:
Implementation Considerations
Fidelity
Under no circumstances should the system lose the ability to manage the devices. Credentials must always be retained in a manner where system management is possible. The risk of being locked out of the devices would entail a very manual and costly process that would degrade customer operations until complete. For mountain devices they must be manually logged into and have their emmc wiped (
emmcnuke
). Then new firmware has to be flashed onto the device.We trust Vault as our secure enclave for device credentials. Vault must be backed up at a high frequency. I suggest that it should be continually backed up, as password changes can happen at any time. Furthermore, there must be strong processes around data egress related to system update and fresh install. If a customer completes a fresh install we should not force them to manually reset all devices in their system.
Single source of truth
Vault is the single operational source of truth for all credentials. However, there are many services: FAS, CAPMC, SLS, REDS, MEDS, HSM, PCS, Conman, etc., that all directly access vault for BMC credentials. While vault has an API, its K/V implementation is more akin to a data storage layer. The weak schema of the K/V store is a weakness in that there is not a robust concept of schema management in Vault. The K/V store schema cannot be updated without enforcing that all clients update to the latest schema, else data loss is imminent.
To illustrate this point, consider the following hypothetical schemas:
If the schema is expanded to:
Any clients that marshal a v2 data structure into a v1 object will drop the
role
field. If an older client writes the record back into the data store, therole
field will not be included in the payload and therefore will be deleted for all clients.To this end, the K/V should track a 'schema', while this does not prohibit old clients from consuming data in a manner that is destructive, it would help clients have a 'hint' on whether or not they are compatible.
Furthermore, I propose that the number of clients allowed direct access to the Vault K/V store be limited to one: SCSD (or whatever credentials service we create).
This would allow for several optimizations:
There are several trade-offs:
hms-creds
). Some level of data migration must exist. This introduces risk of failure.hms-creds
are using the HMS Go pkg:hms-securestorage
andhms-compcredentials
. Perhaps there is a way we can update the library to effectively keep the same interface, although all clients must still upgrade to the latest pkg.Questions
Appendix
System Diagram
go
lang) to provide a cache invalidation layer that keeps local credentials up to date.Vault Schema
Current Schema
This is the current layout for the credentials that HMS depends on.
Proposed Schema for Objective 1
This is a proposed schema layout to support Objective 1. This does not have any role information, which would probably be an extension implemented as part of Objective 2. Im not sure how to store 'schema' versioning in Vault. Perhaps it's a record by record datum.
Thoughts about APIs and Libraries
I've put some considerable thought into what the API should look like for this service. We want to balance flexibility of extension with what we want or need immediately. A few concerns come to mind about the nature of the data.
Today there is only one stored account per xname in Vault. If we allow multiple accounts in Vault per xname, how would a service or an admin distinguish between which account to use? It's possible that there could be unique account names and unique permissions or roles for each account (on each xname). Furthermore, the account 'root' or 'foo' could have different permissions in actuality between two different xnames. Roles are not universal across all hardware vendor implementations This only gets more complicated when we consider different types of credentials: OS, IMPI, Redfish, etc.
The bare minimum need is the ability to create some type of administrator service account (perhaps named
administrator-service-account
) that should haveroot
privileges via Redfish and be the expected 'well known' user account the services should use.I'm thinking that a library that uses the credentials API should have the ability to create a new user with those permissions if they don't exist.
I've been doing reading about GraphQL APIs. They serve to help close the gap common in straight REST APIs of over-fetching (getting too much data we don't care about) and under-fetching (having to make a lot of serial calls to get all the information we do care about). I think that this is ripe for GraphQL consideration. An admin might get a single credential, but most of our micro-services work on batches of xnames.
Suggested Reviewers
- [ ] @atifsyedali - [x] @jeremy-duckworth - [ ] @alexanderkingh - [x] @jsollom-hpe
_- [x] @rsjostrand-hpe
Comment Period
Comment period for this proposal shall close on [[June, 8, 2022]].