Cray-HPE / community

MIT License
5 stars 1 forks source link

Implement a standard format for SHCD files #20

Open jacobsalmela opened 3 years ago

jacobsalmela commented 3 years ago

Abstract

Implement a standardized SHCD format that is automation-friendly.

Problem Statement

To start an install of Shasta, we require a minimum set of information from manufacturing. This information is either non-existent or spread across several tabs in a non-standard format in the SHCD. This information needs to be manually interperted by a human and hand-crafted into a computer-friendly format (JSON, CSV, etc).

                                                                                                        
 ┌───────────────────┐    Human interpretation /     ┌──────────────┐                                   
 │     SHCD file     │─────────manual tasks─────────▶│Multiple seed │                                   
 └───────────────────┘                               │files created │                                   
                                                     └──────────────┘                                   
                                                             │                                          
                            ┌────────────────────────────────┼──────────────────────────────┐           
                            │                                │                              │           
                            ▼                                │                              ▼           
                ┌──────────────────────┐    ┌────────────────┼───────────────┐  ┌──────────────────────┐
                │ hmn_connections.json │    │                ▼               │  │ switch_metadata.csv  │
                └──────────────────────┘    │    ┌──────────────────────┐    │  └──────────────────────┘
                           │                │    │   ncn_metadata.csv   │    │              │           
                           │                │    └──────────────────────┘    │              │           
                           │                ▼                │               ▼              │           
                           │ ┌─────────────────────────────┐ │   ┌──────────────────────┐   │           
                           │ │application_node_config.yaml │ │   │    cabinets.yaml     │   │           
                           │ └─────────────────────────────┘ │   └──────────────────────┘   │           
                           │                │                │               │              │           
                           │                │                │               │              │           
                           └────────────────┼────────────────┼───────────────┼──────────────┘           
                                            │                │               │                          
                                            │                │               │                          
                                            └────────────────┼───────────────┘                          
                                                             │                                          
                                                             │                                          
                                                             │                                          
                                                             │                                          
                                                             ▼                                          
                                                    ┌────────────────┐                                  
                                                    │  CSM install   │                                  
                                                    └────────────────┘                                  

Without standardized and automation-friendly input to start with, we end up struggling for weeks or months trying to get the right information in place or interpreting the SHCD.

Use Cases

Internal References

External References

Proposed Solution(s)

  1. Create a standardized format for the SHCD that allows for csi and/or canu to accept this new standardized input and use it for assembling pieces we need for the CSM install.
┌───────────────────┐                               ┌──────────────┐ 
│   Standardized    │──────────csi ingest──────────▶│   csm.yml    │ 
│     SHCD file     │                               └──────────────┘ 
└───────────────────┘                                       │        
                                                            │        
                                                            ▼        
                                                   ┌────────────────┐
                                                   │  CSM install   │
                                                   └────────────────┘
  1. Create a discovery image for manufacturing that would allow them to boot and discover hardware to generate an inventory. This same image would be used during PIT-mode to generate an inventory and if it matches the one from manufacturing, the install can proceed.

Impact of Action/Inaction

What if we don't solve this problem at this point?

We will continue to hand-edit and computer-friendly config files and run into config issues, which slow down the install process considerably.

What impact is there beyond the problem statement if we fix the problem now?

Other teams and processes will need to adjust their code to account for the new standardized format, which could take consierable effort, but it would provide us confidence moving foward that the configs are all correct.

Further Information

Suggested Reviewers

Comment Period

Comment period for this proposal shall close on [[August 25, 2021]].

dborman-hpe commented 3 years ago

@trad511 (Sean Lynn) will coordinate on this proposal. We also need the period for this proposal extended until at least Thursday, Sept. 2.

jsollom-hpe commented 3 years ago

I totally support this idea. There isn't much here to review other than the idea statement and a bunch of background material.

jacobsalmela commented 3 years ago

There are many tickets that cover the headaches of not having a standard format, but since you already support the idea, there's no need to go down that hole... 😄

trad511 commented 3 years ago

@jacobsalmela I support the idea. The CANU utility which will come fully in CSM 1.2 begins to audit and enforce standards. I believe the win we could have here is to make sure that all CSM teams and tooling use and enforce the same standards. I've seen some of the CSI changes you have made and agree with that direction.

We need to be careful in that the SHCD is essentially owned by no-one, but used by everyone. It's easy to make changes that have real downstream effects on other team's tooling and jobs - DCHW labeling and rack and site layouts as two examples. There is larger CSM process underway to (hopefully) make fundamental changes to the meta-process around this, and executive support around changes in this area.

Out of the larger effort we want that "the SHCD" to be used during the design phase but that once installation is ready, the information is converted to machine-readable format - JSON or ingested into a database - where it can be referenced, updated and versioned from thereon out. This obviously is a messy changeset over a longer time period.

I believe we can have actionable outcomes in this proposal today within CSM by:

If we view the SHCD as the system initial data source, and generally CSI and CANU as the systems and networking tools, then I do think we can make things less complicated today within our sphere of influence. Concretely, within one or two releases we can:

  1. Take an arbitrary SHCD.
  2. Use CANU to validate the SHCD canu validate shcd (available today) which enforces spreadsheet format, device naming, slot naming, port numbering, network architecture by version, hardware used, cabling, etc.... The output of this is a "CSM valid" SHCD. This can be used today internally and will be in the CSM 1.2 release.
  3. By end of Sept 2021 the CANU SHCD validator will also produce a machine readable (JSON) version of the SHCD. This, I believe is one area where the schema should focus.
  4. The new machine-readable SHCD should - new process and required code - be ingested into CSI. This will obviate the need for hmn_connections.json and it's current codebase, as well as the switch_metadata.csv).
  5. CSI would also begin using the schema internally.
  6. A decision-point should be reached whether SLS and SMD should perform data validation via the schema.
  7. Longer term we look at processes to, possibly have CANU: manage cabinets.yaml, generate a system and it's management network to PoR standards based on a few input parameters (in the works). We also should really look at a process to automate the generation of ncn_metadata.csv.
jacobsalmela commented 3 years ago

Please see https://github.com/Cray-HPE/cray-site-init/pull/34

jacobsalmela commented 2 years ago

This feature could eliminate the need for https://github.com/Cray-HPE/cray-site-init/issues/104

jacobsalmela commented 2 years ago

@rkleinman-hpe

In it's present state, we can create an shcd.json file using an shcd.xlsx as input. This was not my original intent with this ticket, but it's a step in the right direction. The machine-readable SHCD should be the first thing that is created and modified, but using the existing SHCD is a good stepping stone.

Here is a workflow that we can currently execute:

The second step here is to eliminate all of these "seed files" and instead generate runtime files such as sls_input_file.json so we avoid all of the seed files. The source of truth (the shcd.json is then directly moving data into files needed for runtime, thus eliminating much of the tedious and error prone process of manually creating them.

Finally, the shcd.json should completely replace the .xlsx files and people should only update the JSON files going forward.