black-marlin / power-redfish

Review the Redfish schema and provide comments that can be incorporated into the standard.
0 stars 0 forks source link

Log Entry Schema: Suggested Changes #2

Open thalerj opened 7 years ago

thalerj commented 7 years ago

Log Entry Scope Issues The changes listed below are to help an end user process the logs quickly with human readable and discernible categorization. Explicit flags are added to help log analyzers run more efficiently.

Sensors Types are mixed between what's happening and where it is happening. Recommend splitting the what and where for clarity. Impacted area of the system (Recommend to be required) • Canister/Appliance – Contains system components not expected to be serviced by a customer. • System Board – main system board, associated risers, system planar, mid-planes, back planes, interconnects • Processing – Involves the processor, processor cards and system board, configuration settings, and microcode, cache, Trusted Computing Module, processor interconnect (QPI cables) • Memory – includes DIMMs, memory card, configuration settings, memory controller, redundant modes (mirroring, spare, etc), RAID memory, NVRAM, EPROM • Power – can be power supplies, VRMs, VRDs, voltage levels, system power state, policies, batteries, AT power width, TPMD, power controllers, external power, Battery Backup Unit(UPS), PDUs • Cooling – Fans, blowers, mux cards, policies, chillers/refrigeration, water management units, water pumps, water filtration, air flow sensors, thermal monitors • I/O connectivity – PCI/USB hub, bridge, bus, risers, configuration settings, interconnect, keyboard, mouse, KVM • Storage RAID – adapters, configuration, settings, interconnect, arrays, drive enclosures • Client Data Storage Device – flash storage adapters, drives, cd/dvd drives, SSD, SAS, DASD, Flash storage, tape, volumes, remoteCopy, flashCopy, managed Storage Systems • Display – Graphics adapters, op panel, monitor/console • VPD – configuration settings, EPROMs, communication • Systems Management – FSM, PSM, HMC, FDMC UEFI, CMM, IOMC, CCE, PMC, DPSM, SVC, management of storage, services, IMM, FSP, systems management networking o Systems Management - Data Management o Systems Management - Events / Monitoring o Systems Management - Core / Virtual Appliance o Systems Management - Console o Systems Management - Security o Systems Management - Service & Support o Systems Management - Config Patterns o Systems Management - Updates o Systems Management - Backup/Restore & Failover (HA) o Systems Management – FlexCat OS/Config deployment o Systems Management - Remote Control o Systems Management - Network Management • Time Reference – RTC, Master clock, drawer clocks, NTP • Hypervisor – Virtual Components, Boots, Crashes, SRIOV, LPARs • OS/Hypervisor Interface – passing of error logs, partition management, services(time, etc), • OS – Power Linux, AIX IPL, AIX, crash and dump codes, IBM i kernal code, IBM i OS, management of storage • Device Driver – AIX, IBM I, Subsystem Device Driver(SDD), IPMI Service • Interconnect - Utilities / Infrastructure • Interconnect - Fabric • Interconnect - Networking – data network, network settings, ports, security, adapters, switches, fiber channel, optical ports, Ethernet, • Interconnect - PCI Manager

General Category of what is happening (Recommend to be required) • Administrative – Audit messages, users logging in and out, powering on and off, virtual reseat, system restarts, Settings update by a user, add or remove hardware and firmware • Security – security breaches, settings changed, policy updates (not related to failures) • Unrecoverable Hardware Failure – Component failures • Correctable Hardware Failure – Loss of redundant component, lane widths, single bit errors • PFA – Predictive failures • Status – update complete, occurring, paused, hung, Temp/power system improvements, RAID rebuilds • Firmware/Software Incompatible – incompatibility with hardware, incompatible with other firmware/software • Firmware/Software Not Valid – unrecognizable image, bad signature, entitlement, bad install/flash • Firmware/Software Failure – halts, exceptions, invalid input, timeouts, failed updates • Communication Failure/Timeout – I2C, Ethernet, SPI, etc. bad parity, no response, missing packet • Monitoring Agent – watchdog timer failures, resets, etc • User Defined Alerts – custom thresholds that are not normally PFAs • Environmental – High/low temperature/voltage warnings and errors, policies, thresholds, air flow, water flow, humidity, etc • Recovered – The system has returned to normal operation, deasserts

Some type of indicator of whether action needs to be taken or not. Severity really indicates urgency IBM internal term is serviceable, however actionable may be a better standardized term

Indicator if support is going to be automatically notified/should be notified/should IBM Internal term is call home, recommend using Notify support as term.

Severities. CIMOM previously supported 7, how is this accounted for in the new model? Also Info was dropped for "OK"

Associated LEDs - Is the alert tied to an LED that's on? Example Fault roll-up or Check log LEDs. Flexible String Field would probably be best due to wide variety. not required for every instance

Maintenance Notes: These are highly helpful for Admins working across several products of several days. Keeps all the information in a single place. Free Form String. not required for every instance

Associated Problem maintenance Record ID if the log entry was sent in automatically to support.

Virtual Machine Migration: should a higher level virtual machine manager consider vacating the system and/or partitions

Related Event IDs: After processing is done or other known behaviors causing additional log entries, include them here. not required

Component Instance: Explicitly calling out a particular slot number/ module for non sensor based monitoring. can be null when not needed.

Associated component Callout: For example Power supply part #:ABCD1234

jk-ozlabs commented 7 years ago

I'm not really clear on what changes (to the schema) you're suggesting here. Are these necessary for POWER support of the Redfish spec, or is this general logging improvements?

thalerj commented 7 years ago

Jeremy, These are suggested improvements to logging. This issue doesn't include everything Venkatesh has added to the files themselves and I'm working on getting a complete list put together that's much cleaner. The logging improvements mentioned above, I'm working a line item for the LC products to leverage these exact values to add troubleshooting value to the SEL logs.