Comprehensive Credential Management Proposal for Hardware Controllers

Andrew Nieuwsma | 2022-05-04 | Initial Approval CASM Complete. Needs customer review and other internal review.

Abstract

This proposal outlines several strategic objectives to increase our security posture for 'management' hardware: BMCs, Switches, PDUs. The major focus of this proposal will impact BMCs most directly as they are the most numerous and are accessed most frequently. From this point on I will refer to all devices (PDU, switches, BMCs) simply as 'devices', I will clarify where necessary.

There is a large spectrum of possibilities for how we can manage device credentials. They have separate costs associated to implement and maintain and fit into a 'defense in depth' paradigm. This proposal suggests an 'ultimate end goal' for device credential management and offers direction options to maximize return on investment.

The following is a comparison between a desired end goal and current system maturity.

ID	Requirement	Potential Future State	Current State
FR1	The system shall be responsible to manage all device credentials.	'CSM' manages all device credentials. There is no credential that exists on a device that CSM is not tracking in Vault.	Only the 'root' password is stored in Vault, the 'operator', 'guest', or 'administrators' are not stored in Vault, therefore CSM has no programmatic awareness of these credentials.
FR2	The system shall use least privileges model and role based authentication for device credentials.	Proper role based authority is assigned for unique service accounts. There are no global service account credentials. Each service account has a separate account for each device. e.g. perhaps HSM needs only 'read-only' authorization, so it gets a 'hsm-user' user account with roles set to 'read only' security primitives. e.g. perhaps FAS needs 'write' level permissions to the devices it manages, so it gets a 'fas-user' user account with minimum roles set to 'write' security primitives.	There is no role based authorization, all accounts are 'root'. While there can be unique 'root' credentials for each device, in practice they are the same on all devices in the system.
FR3	The system shall create randomized, high entropy passwords for each credential.	Each device has a unique set of randomized high entropy passwords.	BMC firmware currently ships with global, well known, advertised passwords baked into the firmware. We have processes for creating a new 'root' password and have an API to support the use case. We have no current capacity to change, or remove the other default credentials.
FR4	The system shall scope service account authority to follow least privileges.	Every service that has authority to access vault has only authority to access credentials for which it's scoped - i.e. no privilege escalation	Any service that can access vault (everything in the service mesh) can get any credential.
FR5	The system shall provide an API to create user accounts, change passwords, assign roles on devices.	A single API exists that can create user accounts, change passwords, assign roles on devices. The API can notify over pub/sub when credentials are rotated so that caching agents are notified of the need to update their cache.	SCSD can change BMC credentials only (not Switch or PDU). It's unclear if SCSD can create new accounts, I think there is a bug that might overwrite the 'root' credential in Vault with a new account, but not actually create that account. SCSD cannot create accounts. SCSD cannot assign roles.
FR6	The system shall have a configurable password rotation facility. The system shall have the capability to revoke credentials.	A password rotation utility exists that can periodically or on-demand reset device passwords.	Passwords must be manually changed via SCSD invocation. The password change is 'effectively' limited to 'root' account only.
FR7	The backing store for credentials shall be secure, resilient, highly available, and shall have backup and recovery, disaster recovery processes.	Vault is robust, and has adequate BR / DR tooling in place.	Vault does have backups to s3 and there are DR procedures in place.

There is undoubtedly a huge work package for all of these items to progress to a 'desirable' future state.

Proposed Solutions

Objectives

I propose the following evolutionary progression to increase our security posture:

*note, I refer to SCSD in this document, SCSD is the current home of the 'change device credentials' logic. SCSD may not remain the home of the credentials API, but for now I will assume it shall.

OBJECTIVE 1:

CSM manages all device credentials.

To implement this, we would require the ability to

Scan the account service on the BMCs (and analogues for non-BMC devices).
Identify credentials that we are not currently tracking in vault.
Change 'arbitrary' accounts on each device. SCSD would need to be expanded to allow for password change of 'arbitrary' credentials
Expand our vault K/V and libraries so that multiple credentials can be stored per xname.

We should create automation around this functionality such that a password rotation utility could be run periodically or on-demand. This utility would be run during the installation process (and in short order after a factory reset of a device) to change the default passwords baked into the firmware. This limits the exposure time of default well known credentials. This functionality might be a separate credential daemon, or might be built into SCSD.

If this objective was met, the following functional requirements would be complete*:

FR1 - The system shall be responsible to manage all device credentials.
FR3 - The system shall create randomized, high entropy passwords for each credential.
FR5 - The system shall provide an API to create user accounts, change passwords, assign roles on devices. *note: 'roles' would not be implemented as part of objective 1.
FR6 - The system shall have a configurable password rotation facility. The system shall have the capability to revoke credentials.
FR7 - The backing store for credentials shall be secure, resilient, highly available, and shall have backup and recovery, disaster recovery processes. *note: this is already in place.

OBJECTIVE 2:

Implement least privileges and RBAC

This is a very large lift and would require phases of development and deployment:

SCSD would need to be able to create new user accounts and set passwords.
SCSD would need to be able to assign roles to user accounts. This requires synchronization across disparate vendor implementations that do not have the same security primitives.
Vault would need to be able to restrict credential access based on the privileges assigned to the service.
Services would need to be altered to get 'specific' service account credentials, not the 'default credential'.

If this objective was met, the following functional requirements would be complete*:

All enumerated capabilities of Objective 1
FR2 - The system shall use least privileges model and role based authentication for device credentials.
FR4 - The system shall scope service account authority to follow least privileges.
FR5 - The system shall provide an API to create user accounts, change passwords, assign roles on devices. *note: only the roles API would be a new addition from the offering in Objective 1.

Implementation Considerations

Fidelity

Under no circumstances should the system lose the ability to manage the devices. Credentials must always be retained in a manner where system management is possible. The risk of being locked out of the devices would entail a very manual and costly process that would degrade customer operations until complete. For mountain devices they must be manually logged into and have their emmc wiped (emmcnuke). Then new firmware has to be flashed onto the device.

We trust Vault as our secure enclave for device credentials. Vault must be backed up at a high frequency. I suggest that it should be continually backed up, as password changes can happen at any time. Furthermore, there must be strong processes around data egress related to system update and fresh install. If a customer completes a fresh install we should not force them to manually reset all devices in their system.

Single source of truth

Vault is the single operational source of truth for all credentials. However, there are many services: FAS, CAPMC, SLS, REDS, MEDS, HSM, PCS, Conman, etc., that all directly access vault for BMC credentials. While vault has an API, its K/V implementation is more akin to a data storage layer. The weak schema of the K/V store is a weakness in that there is not a robust concept of schema management in Vault. The K/V store schema cannot be updated without enforcing that all clients update to the latest schema, else data loss is imminent.

To illustrate this point, consider the following hypothetical schemas:

K/V: xname
v1 Schema
{
    "user":VALUE,
    "password":VALUE
}

If the schema is expanded to:

K/V: xname
v2 Schema
{
    "user":VALUE,
    "password":VALUE,
    "role":VALUE
}

Any clients that marshal a v2 data structure into a v1 object will drop the role field. If an older client writes the record back into the data store, the role field will not be included in the payload and therefore will be deleted for all clients.

To this end, the K/V should track a 'schema', while this does not prohibit old clients from consuming data in a manner that is destructive, it would help clients have a 'hint' on whether or not they are compatible.

Furthermore, I propose that the number of clients allowed direct access to the Vault K/V store be limited to one: SCSD (or whatever credentials service we create).

This would allow for several optimizations:

Schema control would be limited to a single service. Only one service would have to track Vault K/V schema and honor the contract.
SCSD could act as a caching layer and build in cache invalidation methods. This would decrease the amount of traffic to Vault, and sometimes we have experienced a slow vault connection as each key has to be retrieved separately. By putting a cache layer in SCSD we would reduce vault load, and increase client through put.
SCSD could have a client library to allow the numerous consumers data access. The library could include a kafka notification method that would allow consumers to subscribe to password changes. Potentially I think passwords could even be pushed over Kafka assuming the correct cryptographic chain is established.
SCSD could potentially have more fine grained controls than Vault does. Today any service in the service mesh has total access and control of Vault. Any compromised process could completely destroy all keys in Vault because the access controls assume trust in the service mesh.

There are several trade-offs:

SCSD becomes the 'critical path' for credentials. SCSD must have very high reliability, and this introduces another node in the failure domain.
The current K/V schema must be replaced (hms-creds). Some level of data migration must exist. This introduces risk of failure.
All current clients (FAS, CAPMC, etc.) MUST adapt rapidly, or they will fail. There might be a migration path where somehow the old K/V can be populated, but it is probably cleanest if we avoid a data migration for n clients and instead force an API migration. This does present a schedule risk. To my knowledge all consumers of hms-creds are using the HMS Go pkg: hms-securestorage and hms-compcredentials. Perhaps there is a way we can update the library to effectively keep the same interface, although all clients must still upgrade to the latest pkg.
There are current clients of SCSD that use SCSD as the source of truth for credentials (e.g. Slingshot). We would need to be sure that SCSD continues to work for them, and that they understand our flow of control. They need to expect that credentials can change without notice! There may be other external-to-CSM products that need to be notified as well.

Questions

Will a separate service be created, or will SCSD be extended?
1. It's quite possible that the SCSDv1 API is not sufficent for this task. So either a v2 would be created or an entirely new API. SCSD is very 'pipe and filter' architected, so any 'controller' logic for routine password rotation that is not externally triggered might be a weird paradigm to force onto SCSD. The SCSDv1 API would need to be supported until a sunset period.
How will SCSD be synchronized with HSM?
1. Today I believe either the discovery services (reds, meds, hms-discovery job) OR HSM put entries into vault. SCSD has no direct synchronization mechanism, but depends on those actors 'having discovered hardware' to put the xname into vault. I like this paradigm, but it does partially violate the design consideration of ONLY a single entity creating vault entries (SCSD), and is therefore at risk of those short comings (mainly schema migration).
How can a vault entry be corrected if it gets out of sync with the hardware?
1. Today HSM can be informed of the credential in a manner that only updates the vault cred while leaving the BMC cred (what it actually is) alone. Unclear if HSM validates this? or just forces it into the vault datastore.
How are roles defined?
1. Redfish has a set of defined roles, but there are OEM extensions.
2. SNMP and any other protocols would certainly have different terms and potentially concepts.
What is the scope of a role? globally? or per device?
1. Roles are implemented on a per device basis, so there could be a role 'template' that we could 'attempt to apply' per each device. It would probably make sense to have a notion of global roles that can be applied to an xname.
How would vault schema be tracked?
1. it could be explicit in either field, or value, or vault path.
2. I would like to avoid 'implicit' schemas; e.g. the JSON marshal works, so it must be this schema. That is a fallacy, JSON marshal working only validates a FLOOR, i.e. the schema is 'at least' that schema.
Should the API allow a password to be set? or should the API be responsible for setting the credential?
1. It is likely ideal that a customer site would always choose randomized high entropy password. However, a customer may choose to set all credentials to the same low entropy password. While this is likely not ideal, we should not be overly prescriptive to the user. Instead we should allow both options, a user could specify a password, or leave the credential to us, and we will set one of 'specified' complexity.

Appendix

System Diagram

credentials-component-context

SCSD (the credentials agent and API) is responsible for
- talking to devices over hardware protocols (SNMP, Redfish, etc).
- creating accounts, generating passwords.
- storing/retrieving passwords from vault.
Admin actors can retrieve, update, rotate credentials on devices through SCSD
Services that need credentials will go through SCSD to get credentials.
A credentials library would be written (in go lang) to provide a cache invalidation layer that keeps local credentials up to date.

Vault Schema

Current Schema

This is the current layout for the credentials that HMS depends on.

ncn-m001:~ # vault kv list secret/hms-creds
[
  "x3000c0r16b0",
  "x3000c0r16e0",
  "x3000c0s10b1",
  ...,
  "x3000c0w13",
  "x3000c0w14",
  "x3000m0"
]

vault kv get secret/hms-creds/x3000c0r16b0
{
  "request_id": "e284bad4-5a62-7299-e63c-b321172d7802",
  "lease_id": "",
  "lease_duration": 2764800,
  "renewable": false,
  "data": {
    "Password": {REDACTED-FOR-EXAMPLE},
    "SNMPAuthPass": "",
    "SNMPPrivPass": "",
    "URL": "x3000c0r16b0/redfish/v1/Managers/BMC",
    "Username": "root",
    "Xname": "x3000c0r16b0"
  },
  "warnings": null
}

Proposed Schema for Objective 1

This is a proposed schema layout to support Objective 1. This does not have any role information, which would probably be an extension implemented as part of Objective 2. Im not sure how to store 'schema' versioning in Vault. Perhaps it's a record by record datum.

ncn-m001:~ # vault kv list secret/hms-creds-mk2
[
  "x3000c0r16b0",
  "x3000c0r16e0",
  "x3000c0s10b1",
  ...,
  "x3000c0w13",
  "x3000c0w14",
  "x3000m0"
]

ncn-m001:~ # vault kv list secret/hms-creds-mk2/x3000c0r16b0
[
  "root",
  "operator",
  "guest",
  "special
]

vault kv get secret/hms-creds-mk2/x3000c0r16b0/root
{
  "request_id": "e284bad4-5a62-7299-e63c-b321172d7802",
  "lease_id": "",
  "lease_duration": 2764800,
  "renewable": false,
  "data": {
    "Password": {REDACTED-FOR-EXAMPLE},
    "URL": "x3000c0r16b0/redfish/v1/Managers/BMC",
    "Username": "root",
    "Xname": "x3000c0r16b0",
    "ModificationDate": "2022-05-06T12:00:00",
  },
  "warnings": null
}

Thoughts about APIs and Libraries

I've put some considerable thought into what the API should look like for this service. We want to balance flexibility of extension with what we want or need immediately. A few concerns come to mind about the nature of the data.

Today there is only one stored account per xname in Vault. If we allow multiple accounts in Vault per xname, how would a service or an admin distinguish between which account to use? It's possible that there could be unique account names and unique permissions or roles for each account (on each xname). Furthermore, the account 'root' or 'foo' could have different permissions in actuality between two different xnames. Roles are not universal across all hardware vendor implementations This only gets more complicated when we consider different types of credentials: OS, IMPI, Redfish, etc.

The bare minimum need is the ability to create some type of administrator service account (perhaps named administrator-service-account) that should have root privileges via Redfish and be the expected 'well known' user account the services should use.

I'm thinking that a library that uses the credentials API should have the ability to create a new user with those permissions if they don't exist.

I've been doing reading about GraphQL APIs. They serve to help close the gap common in straight REST APIs of over-fetching (getting too much data we don't care about) and under-fetching (having to make a lot of serial calls to get all the information we do care about). I think that this is ripe for GraphQL consideration. An admin might get a single credential, but most of our micro-services work on batches of xnames.

Suggested Reviewers

- [ ] @atifsyedali - [x] @jeremy-duckworth - [ ] @alexanderkingh - [x] @jsollom-hpe
_- [x] @rsjostrand-hpe

Comment Period

Comment period for this proposal shall close on [[June, 8, 2022]].

Note: we should find out if there is a limitation on the number of accounts that can be created on the redfish device... its possible its a few dozen.

Clarification: the 'SNMP' to switch is only for management switches. The HSN switches (Rosetta, Cassini) use Redfish.

I really like this proposal @nieuwsma! However, I do think there is a piece of the puzzle is missing for it to be considered the 'ultimate end goal' for device credential management.

Minor thoughts

FR5 - Current state
1. My understanding is that SCSD is unable to create new user accounts on BMCs. It can only change accounts that already exist on the BMC.
2. I believe this is the bug you are referring to: https://jira-pro.its.hpecorp.net:8443/browse/CASMHMS-5221
Question # 2
1. It is a bit of both. The discovery services put the initial BMC credentials into Vault under hms-creds (for Server Tech PDUs its RTS). The other xnames types (Nodes, Slots, etc..) in Vault are created by HSM when performing a discovery against the BMC.
  1. For example HSM creates x3000c0s26b0n0 and x3000c0s26e0, and the discovery services created x3000c0s26b0:
```
ncn-m001:~ # kubectl exec -it -n vault -c vault cray-vault-0  -- sh -c "export VAULT_ADDR=http://localhost:8200; vault kv list secret/hms-creds" | grep x3000c0s26
x3000c0s26b0
x3000c0s26b0n0
x3000c0s26e0
```
2. With the case of ServerTech PDUs, the admn user creds are stored under pdu-creds outside the purview of HMS/SCSD and managed by RTS. The RTS redfish front end creds are stored under hms-creds
Question # 3
1. The new credentials are forced into Vault, no checking is performed against the BMC to determine if they are valid.
Appendix - System Diagram
1. Should RTS be included in the path of the arrows pointing toward PDUs (if the PDU is servertech, HPE speak redfish) or Switches with SNMP?
2. Is SCSD going to speak non-redfish protocols directly, or is RTS going to perform a translation between Redfish and the devices native protocol?

Proposed Schema for Objective 1

Are we retaining non-BMC credentials in vault?

For example an entry in vault with a node xname, but containing a BMC credential and a URL to a location on the devices redfish for the entity being controlled? Example:

ncn-m001:~ # kubectl exec -it -n vault -c vault cray-vault-0  -- sh -c "export VAULT_ADDR=http://localhost:8200; vault kv get secret/hms-creds/x3000c0s26b0n0"
======== Data ========
Key             Value
---             -----
Password        BMC_ROOT_USER_PASSWORD_HERE
SNMPAuthPass    n/a
SNMPPrivPass    n/a
URL             x3000c0s26b0/redfish/v1/Systems/BQWF72600597
Username        root
Xname           x3000c0s26b0n0

The URL present in Vault is available via HSM under componentEndpoints:


ncn-m001:~ # cray hsm inventory componentEndpoints describe x3000c0s26b0n0 --format json
{  
   "ID": "x3000c0s26b0n0",
   "Type": "Node",  
   "RedfishType": "ComputerSystem",
   "RedfishSubtype": "Physical",  
   "UUID": "1663ee00-0991-11e7-906e-00163566263e",  
   "OdataID": "/redfish/v1/Systems/BQWF72600597", 
   "RedfishEndpointID": "x3000c0s26b0",
   "Enabled": true,
   "RedfishEndpointFQDN": "x3000c0s26b0",
   "RedfishURL": "x3000c0s26b0/redfish/v1/Systems/BQWF72600597",
   "ComponentEndpointType": "ComponentEndpointComputerSystem",
   "RedfishSystemInfo": {
       <truncated>
   }
}

Question about FR1

Does this requirement include the management of default device credentials? Or is this proposal strictly for the management of device credentials post determining the device credentials when hardware discovery is performed?

Under Implementation Considerations -> Fidelity states the following:

Under no circumstances should the system lose the ability to manage the devices

For context we currently keep track the of the following default credentials within the system in Vault

Mountain Hardware default global credentials for the root user
River Node and Router BMC default global credentials for the root user
ServerTech PDU default credentials for the admn user.

Here are scenarios that would prevent us from managing hardware. Granted there are a few scenarios here were we didn't "lose the ability" to manage hardware, as we never had it for new to the system hardware.

If a site customized their global default credentials for the system, then they would be potentiality be unable to manage new or replacement hardware that has been added to the system. We only have knowledge of the current default, and not the factory, manufacturer, or device defaults.
- The new piece of hardware may have the factory default credentials on it, the systems old default credentials, or another systems default credentials depending on the history of the piece of hardware being added.
- A factory reset of the a BMC has occurred, this would cause the loss of the root user account (if it had to be manually added). Since we don't have non-root default credentials we would loose the ability to manage the BMC.
- Hardware from different vendors have different defaults credentials and behaviors. See the table below.
- Mountain cabinets could have different per-cabinet default credentials

Types of defaults

Factory Global Default
- Manufacturing adding the default root credentials to the BMC.
Site Global Defaults
- The sites desired defaults for SNMP, Mountain, and River.
CEC Cabinet defaults
- Each liquid-cooled cabinet could theoretically have its own default credential. See this ticket.
Manufacturer Global default. All devices that share a well known global default.
- HPE PDU credentials
- Server Tech PDU credentials
- Gigabyte BMC credentials
Device Specific Default. Each device has a different default credential baked into it.
- HPE Node BMC credentials.
  - If we were to store this type of information it would most likely help if we stored the admin creds along with the BMC MAC address. If a piece of hardware was moved, we would be able to re-associate the device specific credentials back with it.

In the case of HPE PDUs and ServerTech PDUs with newer firmware an admin has to login using the well known default credentials, and then change the password away from the defaults for the user to become functional.

Having a place to get default credential information would be valuable for the discovery services to have a higher degree of success determining the correcting log in information of BMC, and reduce the admin and triage time required to bring a new piece of hardware into the system.

If we knew the default admin user credentials this would allow for the discovery services to create the root user automatically.
In the case of default user credentials that require a the password to be changed before the device to become functional, it could be performed automatically.
A richer set of default credentials including would ease the process of bringing in or moving hardware within the system.
Allow for the creation of an audit report to determine which devices are currently at the default credentials.

Great write-up, @nieuwsma.

Before I provide feedback in earnest, would you please entertain a few questions (realizing there is some overlap with threads from other reviewers)?

Do you see a future state where access to devices in scope are managed via cryptographic identities vs. passwords?
Can you speak succinctly to the use case of zero-trust provisioning of devices in scope? Specifically, how does your proposal intersect or improve upon our capability to 'not ship default credentials' on impacted devices (including a sparing or default creds situation)?
Is there a way to constrain the number of human or machine principals that must have direct access to credentials? i.e., there are many services in this model that still must directly access the credential to do their job. This is exposure that we should try how to minimize. Consider the patterns that Hashicorp Boundary espouses here, but this could be something as simple as an API layer that provides session tokens vs. raw passwords as an iteration.

Is there a way to constrain the number of human or machine principals that must have direct access to credentials? i.e., there are many services in this model that still must directly access the credential to do their job. This is exposure that we should try how to minimize. Consider the patterns that Hashicorp Boundary espouses here, but this could be something as simple as an API layer that provides session tokens vs. raw passwords as an iteration.

No, not really. All the HMS services need this type of access, and groups like slingshot need Redfish access as well. The session idea is interesting, but presents a scaling and IPC issue.

Can you speak succinctly to the use case of zero-trust provisioning of devices in scope? Specifically, how does your proposal intersect or improve upon our capability to 'not ship default credentials' on impacted devices (including a sparing or default creds situation)?

Im not sure I understand the first part of your question. By centralizing this and creating the tool that effectively OWNs the creds, we can make sure that any 'baked' creds get changed ASAP. This proposal would allow the system to get off of default credentials quickly... maybe not before initial 'power on'; but quickly there after.

Do you see a future state where access to devices in scope are managed via cryptographic identities vs. passwords?

Not from the HMS perspective honestly. We are still quite password bound. But if WE are the only spot that needs direct BMC access (except Slingshot) we reduce that exposure.

Proposal approved as it currently exists. Suggestions and feedback follow.

First, orthogonal to your proposal, I'm not a fan of design reviews in GitHub Issues. Perhaps we can switch to algol60-based Google Docs to keep the conversations public while moving to a more conducive forum (issues in GitHub for workload, design docs linked and published upon approval, ...)?

I suggest the proposal be renamed to reflect the focus on RedFish (or other suitable sub-domain) credential management strategy, for disambiguation.

I also suggest you pull someone in from the PET Team to reason about performance impacts to Vault. My guess is that this feature set could actually reduce calls to Vault, but it might be good to cover.

As follow up to our running threads:

Do you see a future state where access to devices in scope are managed via cryptographic identities vs. passwords?

I know very little about RedFish, or the DMTF. Looking the DMTF Security Protocol and Data Model (SPDM) Specification at https://www.dmtf.org/sites/default/files/standards/documents/DSP0274_1.2.0.pdf, it appears mutual authentication via cryptographic identifies is in spec (see section 7.5). Put another way, a method for the requestor (e.g., a service in HMS) to prove to a RedFish-managed device it should be allowed access (authentication). I also acknowledge that everything you seek to manage may not 'speak' RedFish.

For multiple reasons, including those I'll touch upon in the 'zero trust' thread, I think we should be moving away from password-based authentication models. I'll otherwise digress in commentary here.

Can you speak succinctly to the use case of zero-trust provisioning of devices in scope? Specifically, how does your proposal intersect or improve upon our capability to 'not ship default credentials' on impacted devices (including a sparing or default creds situation)?

Thanks. So it doesn't really speak to zero trust from my perspective (as overloaded as I find this concept), but does speak to an ability to a) quickly reset default passwords and b) make sure default passwords don't sneak back in?

"Towards" zero trust, a representative architecture could include OEM/ODM trusted provisioning of cryptographic hardware identities -- this to prove to a platform (e.g., CSM) that the device has known provenance. Then, for the platform to prove itself to the device, the platform would need to present a cryptographic identity of its own, issued by a source (signed) that the hardware device trusts. There is a similar pattern in play for ~ TPM hardware.

Is there a way to constrain the number of human or machine principals that must have direct access to credentials? i.e., there are many services in this model that still must directly access the credential to do their job. This is exposure that we should try how to minimize. Consider the patterns that Hashicorp Boundary espouses here, but this could be something as simple as an API layer that provides session tokens vs. raw passwords as an iteration.

If I understand your proposal and our 1:1 discussion earlier this week, you're moving credential access into a single API. Notably as credentials could change at any time, I don't see how use of sessions vs. password distribution (to other services) differs in terms of scale? As discussed, there is a very compelling security reason not to broadly distribute creds that are not easily revoked to N services. I realize moving to anything other than passwords for all of these services is currently a huge lift, but please consider the use of sessions in your implementation, notably as this seems like a fair refactoring effort.

As you are moving to a single service for credential management, the implementation should include limiting access to Vault (for hardware credentials) to this service, and auditing functionality that expresses what services are requesting which credentials from this API, at what time, etc. This in your service, as Vault will only see access from it.

And some specific feedback towards your requirements table:

FR4 - This may be true for certain secret stores in Hashicorp Vault, but not all, and the use of Vault transcends the KV engine. Your point stands though, service principal access needs to be more granular, this both from a security and reliability perspective. This is true of workloads 'running in' Kubernetes, and current patterns in use around system management orchestration that powers our control and data planes.

FR5 - Auditing, and perhaps some capability to do credential escrow (save last N credentials), would be good points to speak to. Also, with respect to FR3, Vault should ideally also be leveraged for PRNG. This as applications should really not look to roll PRNG/crypto primitives/key generation on their own. This speaks to your 'high entropy' response to FR3 as part of objective 1, et al.

FR7 - As part of a future vision, I suggest adding that our implementation of Vault needs a hardware root of trust story. Today, the security of Vault generalizes to the security of CSM's Kubernetes configuration. For this proposal and as a general security primitive, our secret management solution needs to evolve along these lines.

Great write up.

Question:

This may be my lack of knowledge of our complete hardware specific bmcs, but are we ruling out ldap authentication as a future evolution? This question is more to further my understanding. I know you stated "We are still quite password bound" but I am not sure if we have devices that just will never support ldap.

Providing answers to come questions:

In terms of vault, I believe we are adequately protected in terms of backups, but a user create/update function should kick off an immediate backup. This should be relatively easy to build if we implement process via argo or into the update function itself.
In terms of additional load on vault, I believe we should be fine. Vault has been very solid since we moved to raft vs the etcd provided by our etcd operator. We also have plans to provide a faster storage tier if needed, but that will rely on our milan hardware spec or moving vault to master nodes. Either way I believe we are covered here.

Concerns:

Something that we need to call out is GB vs HPE nodes. HPE I believe has some credentials on a sticker on the system. I am not sure what we are doing this those credentials, but normally that would be what an on-site tech would be using when diagnosing hardware issues. Are we building in the ability to toggle that on/off?

This may be my lack of knowledge of our complete hardware specific bmcs, but are we ruling out ldap authentication as a future evolution? This question is more to further my understanding. I know you stated "We are still quite password bound" but I am not sure if we have devices that just will never support ldap.

Yes, to my knowledge ldap is not supported on BMCs. Its basic auth and ssh keys only type of environment.

Concerns:

Something that we need to call out is GB vs HPE nodes. HPE I believe has some credentials on a sticker on the system. I am not sure what we are doing this those credentials, but normally that would be what an on-site tech would be using when diagnosing hardware issues. Are we building in the ability to toggle that on/off?

Yes, we plan on 'burning' any default credential on the system. The only legitimate password will be the passwords we set.

Excellent!

To extend:

Mechanism for full rollout and randomization of all creds on the system in one operation
Mechanism to dump the results into an certificate-encrypted format, suitable for printing a backup password sheet

This is excellent!!! 👍

My only suggestion regarding the API would be to evaluate whether it'd make sense to add the capability to list all stored credentials in a way that permits the synchronisation with external password managers. This can be very useful for sites having personnel doing on-call interventions in the event that SCSD is unavailable.

I would like to see some detail around the expected handling of edge/corner cases (replaced/moved/reworked endpoints, etc) Steady state doesn't bother me, it is the less travelled paths that keep me awake at night for work like this.

Solutions are not required, but documenting minimal expected behaviour; ramifications for service events, catastrophic data loss; other gotchas; etc would help frame this proposal better.

Great callout Andy. This document is a high level design and I would ultimately expect the implementation team to really flesh that out, but Ill provide some comments.

I agree with your concern. What 'concerns' me the most is around the blade swap or 'unexpected' rediscovery scenario. I would expect that we would need to come up with a process, and wherever possible automation, to move credentials around. This is especially tricky if the CEC's havent been pre-programmed.

The negative effect of these processes going wrong is mostly debug gets harder, because it may take some time to realize that the expected credential is not accurate. The ramification is that potentially a manual, physical hardware recovery process might need to be undertaken to rectify the device credential. One-offs would be annoying but manageable, but 'loosing' credentials for a higher number of nodes would be probably exponentially more frustrating.

I think ultimately as we go towards implementation understanding the direct mfg and field impacts would be good.

Cray-HPE / community