bmc-toolbox / bmclib

Library to abstract Baseboard Management Controller interaction
Apache License 2.0
194 stars 37 forks source link

Refactor BMC provider opener to probe concurrently #322

Closed chrisdoherty4 closed 1 year ago

chrisdoherty4 commented 1 year ago

Summary

In the Tinkerbell project where we leverage bmclib in Cluster API contexts we've had reports of BMC interactions taking too long resulting in CAPI components rolling nodes it thinks are faulty. As part of the investigation we dived into the timeout capabilities available in bmclib when connections/interfaces are initial established/probed and found a user can either (1) specify a static timeout that's shared among all providers (IPMI, Redfish etc) or (2) specify a per provider timeout that will linearly grow the total timeout applied with the total providers attempted.

This change alters the algorithm to open connections to operate concurrently. By operating concurrently we stand a better chance of giving each provider the maximal timeout period to establish its connection and avoid the linear growth with per provider timeouts.

We haven't benchmarked this change and acknowledge systems that don't offload network IO or are single CPU won't necessarily benefit but given modern consumer hardware is generally multicore we anticipate it will bring a general improvement.

joelrebel commented 1 year ago

Thanks for the efforts here! Could we also include a check for Goroutine leaks in TestOpenConnectionFromInterfaces so its covered for now and any changes going ahead.

chrisdoherty4 commented 1 year ago

I was almost ready to open a PR.

Nice, I did this quite late so haven't verified everything. Lets compare the approaches and see what we prefer.

codecov[bot] commented 1 year ago

Codecov Report

Patch coverage: 95.00% and project coverage change: +0.28 :tada:

Comparison is base (8712352) 43.23% compared to head (be7fed9) 43.52%.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #322 +/- ## ========================================== + Coverage 43.23% 43.52% +0.28% ========================================== Files 35 35 Lines 2736 2748 +12 ========================================== + Hits 1183 1196 +13 + Misses 1430 1429 -1 Partials 123 123 ``` | [Impacted Files](https://app.codecov.io/gh/bmc-toolbox/bmclib/pull/322?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=bmc-toolbox) | Coverage Δ | | |---|---|---| | [bmc/connection.go](https://app.codecov.io/gh/bmc-toolbox/bmclib/pull/322?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=bmc-toolbox#diff-Ym1jL2Nvbm5lY3Rpb24uZ28=) | `94.11% <95.00%> (+3.04%)` | :arrow_up: |

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

chrisdoherty4 commented 1 year ago

Thanks for the efforts here! Could we also include a check for Goroutine leaks in TestOpenConnectionFromInterfaces so its covered for now and any changes going ahead.

Overlooked this comment, will add it in.

chrisdoherty4 commented 1 year ago

@joelrebel The goleak tests are in place. I ran the tests repeatedly to tease our any concurrency problems but they consistently passed.