Open jacobsalmela opened 3 years ago
@trad511 (Sean Lynn) will coordinate on this proposal. We also need the period for this proposal extended until at least Thursday, Sept. 2.
I totally support this idea. There isn't much here to review other than the idea statement and a bunch of background material.
There are many tickets that cover the headaches of not having a standard format, but since you already support the idea, there's no need to go down that hole... 😄
@jacobsalmela I support the idea. The CANU utility which will come fully in CSM 1.2 begins to audit and enforce standards. I believe the win we could have here is to make sure that all CSM teams and tooling use and enforce the same standards. I've seen some of the CSI changes you have made and agree with that direction.
We need to be careful in that the SHCD is essentially owned by no-one, but used by everyone. It's easy to make changes that have real downstream effects on other team's tooling and jobs - DCHW labeling and rack and site layouts as two examples. There is larger CSM process underway to (hopefully) make fundamental changes to the meta-process around this, and executive support around changes in this area.
Out of the larger effort we want that "the SHCD" to be used during the design phase but that once installation is ready, the information is converted to machine-readable format - JSON or ingested into a database - where it can be referenced, updated and versioned from thereon out. This obviously is a messy changeset over a longer time period.
I believe we can have actionable outcomes in this proposal today within CSM by:
If we view the SHCD as the system initial data source, and generally CSI and CANU as the systems and networking tools, then I do think we can make things less complicated today within our sphere of influence. Concretely, within one or two releases we can:
canu validate shcd
(available today) which enforces spreadsheet format, device naming, slot naming, port numbering, network architecture by version, hardware used, cabling, etc.... The output of this is a "CSM valid" SHCD. This can be used today internally and will be in the CSM 1.2 release.This feature could eliminate the need for https://github.com/Cray-HPE/cray-site-init/issues/104
@rkleinman-hpe
In it's present state, we can create an shcd.json
file using an shcd.xlsx
as input. This was not my original intent with this ticket, but it's a step in the right direction. The machine-readable SHCD should be the first thing that is created and modified, but using the existing SHCD is a good stepping stone.
Here is a workflow that we can currently execute:
shcd.json
using canu
switch_metadata.csv
, hmn_connections.json
, application_node_config.yaml
, and ncn_metadata.csv
with csi
using shcd.json
as inputcsi config init
The second step here is to eliminate all of these "seed files" and instead generate runtime files such as sls_input_file.json
so we avoid all of the seed files. The source of truth (the shcd.json
is then directly moving data into files needed for runtime, thus eliminating much of the tedious and error prone process of manually creating them.
Finally, the shcd.json
should completely replace the .xlsx
files and people should only update the JSON files going forward.
Abstract
Implement a standardized SHCD format that is automation-friendly.
Problem Statement
To start an install of Shasta, we require a minimum set of information from manufacturing. This information is either non-existent or spread across several tabs in a non-standard format in the SHCD. This information needs to be manually interperted by a human and hand-crafted into a computer-friendly format (JSON, CSV, etc).
Without standardized and automation-friendly input to start with, we end up struggling for weeks or months trying to get the right information in place or interpreting the SHCD.
Use Cases
hmn_connections.json
ncn_metadata.csv
switch_metadata.csv
cabinets.yaml
application_node.yaml
csi
's interpretation of the data to be the new source of truthInternal References
External References
Proposed Solution(s)
csi
and/orcanu
to accept this new standardized input and use it for assembling pieces we need for the CSM install.Impact of Action/Inaction
What if we don't solve this problem at this point?
We will continue to hand-edit and computer-friendly config files and run into config issues, which slow down the install process considerably.
What impact is there beyond the problem statement if we fix the problem now?
Other teams and processes will need to adjust their code to account for the new standardized format, which could take consierable effort, but it would provide us confidence moving foward that the configs are all correct.
Further Information
application_node_config.yaml
cabinets.yaml
hmn_connections.json
ncn_metadata.csv
switch_metadata.csv
Suggested Reviewers
Comment Period
Comment period for this proposal shall close on [[August 25, 2021]].