flux-framework / flux-core

core services for the Flux resource management framework
GNU Lesser General Public License v3.0
159 stars 49 forks source link

t2310-resource-module test fails when PCI link speeds change #3643

Closed garlick closed 3 years ago

garlick commented 3 years ago

This test failure seems to result from the PCI link speed changing, presumably due to power management, when the topology is sampled twice on the same node.

Not sure if maybe there are some hwloc flags we can use to exclude this level of detail from the topology?

expecting success: 
    for i in $(seq 0 $(($SIZE-1))); do \
        test_cmp hwloc_direct/$i.xml hwloc/$i.xml || return 1; \
    done

--- hwloc_direct/0.xml  2021-05-08 17:00:33.826895086 +0000
+++ hwloc/0.xml 2021-05-08 17:00:33.522888343 +0000
@@ -131,10 +131,10 @@
           </object>
         </object>
       </object>
-      <object type="Bridge" bridge_type="1-1" depth="1" bridge_pci="0000:[09-09]" pci_busid="0000:00:03.1" pci_type="0604 [1022:1483] [1022:1453] 00" pci_link_speed="4.923077">
+      <object type="Bridge" bridge_type="1-1" depth="1" bridge_pci="0000:[09-09]" pci_busid="0000:00:03.1" pci_type="0604 [1022:1483] [1022:1453] 00" pci_link_speed="2.000000">
         <info name="PCIVendor" value="Advanced Micro Devices, Inc. [AMD]"/>
         <info name="PCIDevice" value="Starship/Matisse GPP Bridge"/>
-        <object type="PCIDev" pci_busid="0000:09:00.0" pci_type="0300 [10de:128b] [1043:85f7] a1" pci_link_speed="4.923077">
+        <object type="PCIDev" pci_busid="0000:09:00.0" pci_type="0300 [10de:128b] [1043:85f7] a1" pci_link_speed="2.000000">
           <info name="PCIVendor" value="NVIDIA Corporation"/>
           <info name="PCIDevice" value="GK208B [GeForce GT 710]"/>
           <object type="OSDev" name=":1.0" osdev_type="1">
not ok 20 - hwloc XML from both sources match
SteVwonder commented 3 years ago

This looked familiar (I think Garrett ran into it a while back). The previous issue (https://github.com/flux-framework/flux-core/issues/2988) and PR (https://github.com/flux-framework/flux-core/pull/2998). In that PR, @grondo added

sanitize_hwloc_xml() {
    sed 's/pci_link_speed=".*"//g' $1
}

to t2005 to avoid the issue. Seems like it could work here too.