OpenCHAMI / roadmap

Public Roadmap Project for Ochami
MIT License
0 stars 0 forks source link

[RFD] Introduce a simple node orchestration API to simplify interaction with Nodes for the Supercomputer Institute #23

Open alexlovelltroy opened 5 months ago

alexlovelltroy commented 5 months ago

With the current API spec, adding nodes is a bit cumbersome for our students to navigate.

Our microservices optimize for adding/removing nodes on the fly through a continuous discovery system based on Redfish. The appearance of a new Redfish Endpoint in SMD triggers a discovery process that follows the Redfish tree of whatever device is connected. That applies as easily to a chilled water system as it does to a compute node. This flexibility is great when operating a large heterogeneous system, but confusing at the point where students need to bootstrap a small cluster and operate it for a short lifetime.

With so many concepts to teach, I don't think it's reasonable to include this extended and flexible system. Our students will want to be able to add nodes directly to the system and query the system in one place to understand the compute node makeup of the system.

I am proposing the creation of a thin node orchestration API that presents a Node-centric view of the world to students and converts their intent to smd/bss commands on the backend. This go microservice will support filtered CRUD operations on a Node schema that looks like this:

type BootProfile struct {
    // The internal UUID of the node
    ID uuid.UUID `json:"id,omitempty"`
    // Public SSH Key for access to the node
    SSHKey string `json:"ssh_key,omitempty"`
    // AdminUsername is the username for the admin account on the node (non-root users will need properly configured sudo access)
    AdminUsername string `json:"admin_username,omitempty"`
    // UserData is the Cloud-Init user-data for the node
    UserData string `json:"user_data,omitempty"`
    // The url for the root filesystem image for the node which will be used by BSS to create the boot script
    ImageURL string `json:"image_url,omitempty"`
    // The url for the kernel image for the node
    KernelURL string `json:"kernel_url,omitempty"`
    // The url for the initrd image for the node
    InitrdURL string `json:"initrd_url,omitempty"`
    // Kernel parameters for the node
    KernelParams string `json:"kernel_params,omitempty"`
}

// ComputeNode represents an individual node, physical or virtual, that we can boot and run jobs on
type ComputeNode struct {
    // The internal UUID of the node
    ID uuid.UUID `json:"id,omitempty"`
    // The XName of the node following xname standard from github.com/Cray-HPE/hms-xname
    XName string `json:"xname"`
    // ManagementIP is the IP address of node on the management network
    ManagementIP string `json:"management_ip,omitempty"`
    // BMCIP is the IP address of the node's BMC
    BMCIP string `json:"bmc_ip,omitempty"`
    // BMCMAC is the MAC address of the node's BMC
    BMCMAC string `json:"bmc_mac,omitempty"`
    // BootMac is the MAC address of the node's boot interface
    BootMAC     string      `json:"boot_mac,omitempty"`
    BootProfile BootProfile `json:"boot_profile,omitempty"`
    // Architecture of the node
    Arch string `json:"arch,omitempty"`
    // LastUpdated is the last time the NodeSpec was updated
    LastUpdated time.Time `json:"last_updated,omitempty"`
}

type NodeGroup struct {
    // The internal UUID of the group
    ID uuid.UUID `json:"id,omitempty"`
    // Description of the nodegroup. Intention and Usage Information
    Description string `json:"description"`
    // List of nodes in the group
    Nodes []ComputeNode `json:"nodes"`
    // LastUpdated is the last time the Group was updated
    LastUpdated time.Time `json:"last_updated,omitempty"`
}

Through these three structures, I believe we can capture the needs of a node-centric student and convert those into upstream api calls.

What am I missing? Should we proceed?

njones-lanl commented 5 months ago

I think it'd be nice to have an input wrapper as well, so the students can write in a standard data format (I'm a yaml enjoyer), and run something against it to create their resources.

It'd also replicate how other microservice architectures do their resource ingestion (ie kubectl apply -f yourcoolpodshere.yaml) .

alexlovelltroy commented 5 months ago

as a fellow yaml enjoyer, I was thrilled to see that Apple has open sourced PKL which is even awesomer.

https://pkl-lang.org/