dxc-technology / Hanlon-Microkernel

A small (in-memory) Microkernel used by the Hanlon server for discovery of new nodes
Other
20 stars 21 forks source link

Feature/issue 32 #34

Open hickey opened 8 years ago

hickey commented 8 years ago

This is the rewrite of the hnl_mk_hardware_facter.rb code into separate Facter modules. This makes it easier to maintain the code and integrates the external facts directly into Facter. One is able to see the Facter variables directly on the system and do better troubleshooting.

The processing of the output of lshw has also been greatly simplified. The most complex processing occurs in hnl_mk_mem.rb due to the multi-level hash that gets generated. In this case the processing of the lshw output has been delegated to the def_to_hash() function which gets called recursively for each level of the hash. It may be a consideration to move this function into each Facter module or into a common file imported into each Facter module, but I have yet to find any multi-level structure beyond the memory class.

A few new summation variables have also been created to make it easier to produce better system tags. All the variables take the form of mkhw_count, where item is cpu, core, disk, volume and nic.

The Dockerfile now also creates the /etc/facter/facts.d directory to prompt the end user to include scripts to produce their own facts. In addition, this is an excellent way for the external systems to inject facts that can be seen by the Hanlon controller to build tags based on an external data source.

Similar to the above a future development may be to query the Hanlon controller from a Facter module to inject additional facts from the controller's point of view. This would require a new REST call on the controller to respond with a JSON document with fact definitions enclosed. This functionality could be used to create basic workflow processes. For example booting microkernels for HW discovery, firmware upgrades, BIOS settings, registration in PAAS system, OS install and finally decommissioning of HW asset at end of life (i.e. wiping of hard drives). This would allow Hanlon to begin managing the full lifecycle of asset.

hickey commented 8 years ago

I also meant to mention that I have been testing this code base on multiple bare metal and VMs to resolve differences between different hardware platforms. At this point I am not finding any variances with any platform and consider it good enough for general use. At least until someone is able to document differences in the output of lshw for a specific platform.

jcpowermac commented 8 years ago

@hickey Thanks for your contribution. I will work on testing this PR today.

tjmcs commented 8 years ago

Looks like a great start (based on a quick glance through the code). I'll dig into it a bit more later but my preliminary review left me with a few questions:

I'd also like to see a plan for how we might add support for discovery of the network topology around the node to the Microkernel (via LLDP) since that's what started this discussion originally), but that can be captured as an issue (marked as an enhancement) and added in a later PR.

Other than those concerns, this looks great @hickey, and I'm looking forward to seeing these changes merged.

jcpowermac commented 8 years ago

@hickey Using your most current commit

⚡ root@hanlon-mk1  ~/Hanlon-Microkernel   facterfix  git rev-parse --verify HEAD 
f3a1811432843774a99255683e604be7860e9131  

I am using two virtual machines with Rancher OS version 0.4.3:

I get this from docker log and when I run manually.

bash-4.3# /usr/local/bin/hnl_mk_init.rb 
Looking for network, this is attempt #1
Network is available, proceeding...
/usr/local/lib/site_ruby/facter/hnl_mk_sys.rb:20:in `block (2 levels) in <top (required)>': undefined method `gsub' for nil:NilClass (NoMethodError)
        from /usr/local/lib/site_ruby/facter/hnl_mk_sys.rb:20:in `collect'
        from /usr/local/lib/site_ruby/facter/hnl_mk_sys.rb:20:in `block in <top (required)>'
        from /usr/local/lib/site_ruby/facter/hnl_mk_sys.rb:15:in `each'
        from /usr/local/lib/site_ruby/facter/hnl_mk_sys.rb:15:in `<top (required)>'
        from /usr/lib/ruby/gems/2.2.0/gems/facter-2.4.6/lib/facter/util/loader.rb:130:in `load'
        from /usr/lib/ruby/gems/2.2.0/gems/facter-2.4.6/lib/facter/util/loader.rb:130:in `kernel_load'
        from /usr/lib/ruby/gems/2.2.0/gems/facter-2.4.6/lib/facter/util/loader.rb:115:in `load_file'
        from /usr/lib/ruby/gems/2.2.0/gems/facter-2.4.6/lib/facter/util/loader.rb:49:in `block (2 levels) in load_all'
        from /usr/lib/ruby/gems/2.2.0/gems/facter-2.4.6/lib/facter/util/loader.rb:47:in `each'
        from /usr/lib/ruby/gems/2.2.0/gems/facter-2.4.6/lib/facter/util/loader.rb:47:in `block in load_all'
        from /usr/lib/ruby/gems/2.2.0/gems/facter-2.4.6/lib/facter/util/loader.rb:45:in `each'
        from /usr/lib/ruby/gems/2.2.0/gems/facter-2.4.6/lib/facter/util/loader.rb:45:in `load_all'
        from /usr/lib/ruby/gems/2.2.0/gems/facter-2.4.6/lib/facter/util/collection.rb:76:in `fact'
        from /usr/lib/ruby/gems/2.2.0/gems/facter-2.4.6/lib/facter/util/collection.rb:129:in `value'
        from /usr/lib/ruby/gems/2.2.0/gems/facter-2.4.6/lib/facter.rb:117:in `value'
        from /usr/local/lib/ruby/hanlon_microkernel/hnl_host_utils.rb:15:in `initialize'
        from /usr/local/bin/hnl_mk_init.rb:38:in `new'
        from /usr/local/bin/hnl_mk_init.rb:38:in `<main>'
hickey commented 8 years ago

As far as I can tell all the existing fact names are the same. I did not want to disturb existing installations with changing fact names and cause all sorts of heartburn. Yes, there are a couple new fact names, but they remain consistent with the naming of the existing facts. If you are finding that to not be correct, please let me know so that scripts can get corrected.

The facts from lscpu are still there. They are generated by the hnl_mk_cpu.rb file. I did add a couple of new ones given the output of lscpu:

mk_hw_lscpu_architecture => x86_64
mk_hw_lscpu_bogomips => 6784.58
mk_hw_lscpu_byte_order => Little Endian
mk_hw_lscpu_cores_per_socket => 4
mk_hw_lscpu_cpu_family => 6
mk_hw_lscpu_cpu_mhz => 3392.294
mk_hw_lscpu_cpu_op-modes => 32-bit, 64-bit
mk_hw_lscpu_hypervisor_vendor => VMware
mk_hw_lscpu_l1d_cache => 32K
mk_hw_lscpu_l1i_cache => 32K
mk_hw_lscpu_l2_cache => 256K
mk_hw_lscpu_l3_cache => 8192K
mk_hw_lscpu_model => 42
mk_hw_lscpu_numa_nodes => 1
mk_hw_lscpu_sockets => 1
mk_hw_lscpu_stepping => 7
mk_hw_lscpu_threads_per_core => 1
mk_hw_lscpu_vendor_id => GenuineIntel
mk_hw_lscpu_virtualization_type => full

As for the BMC generated facts, I have not generated a file for these facts yet. Principally because of issue #31. Once #31 gets resolved then dropping a new file in is trivial to generate the BMC facts. The more I investigate, the more I am finding that this is an issue with RancherOS. It seems that without driver support from them it is not possible to make a connection to the BMC from within a docker container. I expect that they will be at DockerCon in Seattle and I plan on cornering them until they understand that they need to support access to the BMC if they expect to be used widespread in a datacenter.

ARG!! Can you post the output of lshw -c system? The system facts was a late addition. It did not seem like the original fact generation script generated facts from the system class (but I might have overlooked the code) and the output of the system class is slightly different than the other classes. From the stack trace it looks like the output is triggering the regex but nothing is getting captured before the colon..... very interesting.

Actually the first facter module script I had exception processing around the entire loop generating facts for troubleshooting purposes. Once things got much more stable I pulled it out as it was not really needed. I have considered putting in a much constrained handling of exceptions and I suspect that it really is going to be needed. One of the effects I saw early on was that the entire docker container would exit out on an exception (which I suspect you saw also) and that is not good for anyone.

Give me a day or so to work exception processing back into the scripts. My current thought right now is to have two levels. The inner level will handle problems with parsing output lines (like what you got), but continue processing trying to generate as many facts as possible in the event of an exception. Then the outer level will handle when things are really messed up (like not executing lshw or output in a significantly different form than what is expected) and output some diagnostic information and requesting it get posted as an issue. With the variations of hardware out there this will ideally help identify where the scripts can be updated to support the hardware and make them more bullet proof.

hickey commented 8 years ago

BTW, I also have been using RancherOS 0.4.3 for testing on the following platforms:

hickey commented 8 years ago

No need to post the lshw output, I am finding the same issue on one of my test platforms. The inner loop exception processing handles this issue. I will add the exception to all the files and update the branch.

jcpowermac commented 8 years ago

@hickey the last time I tested this it was broke in my KVM test environment. Let me try it again and I will get back to you.

hickey commented 8 years ago

I have resolved the problems with communicating with the BMC from within a Docker container. There is a fundamental change to the way that the microkernel container gets started so that the IPMI device drivers get loaded and linked into the container. To do so, I have added an entrypoint script to the microkernel so that when it is initialized the drivers are loaded and seen within the container. I suspect that a similar process will need to be done for LLDP to operate correctly within the container.

Now that the BMC controller can be access from within the container, I have added the IPMI facter code. Here is a sample output of the IPMI facts:

mk_ipmi_802.1q_vlan_id => Disabled
mk_ipmi_802.1q_vlan_priority => 0
mk_ipmi_additional_device_support => ["Sensor Device", "SDR Repository Device", "SEL Device", "FRU Inventory Device", "IPMB Event Receiver", "IPMB Event Generator", "Chassis Device"]
mk_ipmi_auth_type_support => MD2 MD5 OEM
mk_ipmi_aux_firmware_rev_info => ["0x01", "0x00", "0x00", "0x00"]
mk_ipmi_backup_gateway_ip => 0.0.0.0
mk_ipmi_backup_gateway_mac => 00
mk_ipmi_bad_password_threshold => Not Available
mk_ipmi_bmc_arp_control => ARP Responses Enabled, Gratuitous ARP Disabled
mk_ipmi_cipher_suite_priv_max => aaaaXXaaaXXaaXX
mk_ipmi_default_gateway_ip => 10.2.8.1
mk_ipmi_default_gateway_mac => 00
mk_ipmi_device_available => yes
mk_ipmi_device_id => 32
mk_ipmi_device_revision => 1
mk_ipmi_firmware_revision => 1.38
mk_ipmi_fru_0_board_mfg => Super Micro
mk_ipmi_fru_0_board_mfg_date => Mon Jan  1 00
mk_ipmi_fru_0_board_part_number => Winbond Hermon
mk_ipmi_fru_0_board_product => IPMI 2.0
mk_ipmi_fru_0_board_serial =>
mk_ipmi_fru_0_product_manufacturer => Super Micro
mk_ipmi_fru_0_product_name => IPMI 2.0
mk_ipmi_fru_0_product_part_number => Winbond Hermon
mk_ipmi_fru_0_product_serial =>
mk_ipmi_gratituous_arp_intrvl => 0.0 seconds
mk_ipmi_ip_address => 10.2.8.125
mk_ipmi_ip_address_source => Static Address
mk_ipmi_ip_header => TTL=0x00 Flags=0x00 Precedence=0x00 TOS=0x00
mk_ipmi_ipmi_version => 2.0
mk_ipmi_mac_address => 00
mk_ipmi_manufacturer_id => 47488
mk_ipmi_manufacturer_name => Unknown (0xB980)
mk_ipmi_product_id => 43707 (0xaabb)
mk_ipmi_product_name => Unknown (0xAABB)
mk_ipmi_provides_device_sdrs => no
mk_ipmi_rmcp+_cipher_suites => 1,2,3,6,7,8,11,12,0
mk_ipmi_snmp_community_string => AMI
mk_ipmi_subnet_mask => 255.255.252.0

I have one issue that developed last night that needs to be resolved before any of this code can be approved. After several of the container changes, the booted node stopped sending facter data back to Hanlon. I plan on having this resolved today for the final change.

hickey commented 8 years ago

OK. Things are now back to normal and everything is as expected.

I will look into updating the ProjectHanlon::Object::update_self to give meaningful information when there are errors storing data to the persistent store.

tjmcs commented 7 years ago

where are we at on this PR, @hickey? My apologies for the delayed response, but I left CSC for Intel last February, then moved on to DataNexus (my current company) in September. In the meantime, @jcpowermac and @mtnbikenc (the developers I left this project with when I left CSC last February) have also left CSC for RHAT. Somehow in the process of moving on from CSC to RHAT, this project (and the Hanlon project) seem to have been left leaderless. I'm going to try to pick up the support for the projects now that I'm not at Intel anymore, but it may take a bit of time for me to test these changes out. Knowing where we're at would help me greatly, so LMK what you can. Has the last update in your conversation with @jcpowermac (to give a meaningful output when there are errors storing data) been made?

hickey commented 7 years ago

Sorry, for the delay back to you! I also started a new job at the end of the year and I am still trying to reorganize.

I think that the PR is good to go, but I will look over things again within the next week. I usually try to leave things in a pretty good state unless there a bunch of fixes that I am working on (although I will usually close the PR and open a new one when done.

As for a more meaningful message during data storage failures, I suspect that this has not yet been done. I will check my private repo at home to see if there are commits that have not been pushed yet.

On another note, I am just coming back from the Cloud Native Conf in Berlin. I am thinking that I am going to submit a presentation for the next Cloud Native Conf in December. It would be great to have these updates in the code base as the conf approaches. I would not expect any problems--there is certainly enough time. Just figured I would give you a heads up.

More later as things settle down again.