Cray-HPE / gru

A utility for reading and modifying BMCs (e.g. iLO, RMMC) using RedFish (gofish).
MIT License
6 stars 2 forks source link

Scaling Problem #73

Open rustydb opened 2 weeks ago

rustydb commented 2 weeks ago
SUMMARY

Report from user:

Asynchronously querying [ 2714] hosts ...

Resulted in

goroutine 6043 [select]:
net.(*Resolver).lookupIPAddr(0xfd1320, {0xbd5340?, 0xc003ffe5a0}, {0xab837f, 0x3}, {0xc00032c920, 0xb})
    /usr/local/go/src/net/lookup.go:334 +0x505
net.(*Resolver).internetAddrList(0xbd5340?, {0xbd5340?, 0xc003ffe5a0?}, {0xab837f, 0x3}, {0xc00032c920?, 0xc00175a6f8?})
    /usr/local/go/src/net/ipsock.go:288 +0x519
net.(*Resolver).resolveAddrList(0x1000a30?, {0xbd5340, 0xc003ffe5a0}, {0xab8609, 0x4}, {0xab837f?, 0xc00175a8a8?}, {0xc00032c920, 0xf}, {0x0, ...})
    /usr/local/go/src/net/dial.go:234 +0x41b
net.(*Dialer).DialContext(0xc0001cc070, {0xbd5308, 0xc0001b2000}, {0xab837f, 0x3}, {0xc00032c920, 0xf})
    /usr/local/go/src/net/dial.go:422 +0x448
net/http.(*Transport).dial(0x9a70a0?, {0xbd5308?, 0xc0001b2000?}, {0xab837f?, 0xc003ffe450?}, {0xc00032c920?, 0x0?})
    /usr/local/go/src/net/http/transport.go:1176 +0xe7
net/http.(*Transport).dialConn(0xc003fea640, {0xbd5308, 0xc0001b2000}, {{}, 0x0, {0xc000c345a0, 0x5}, {0xc00032c920, 0xf}, 0x0})
    /usr/local/go/src/net/http/transport.go:1614 +0x82c
net/http.(*Transport).dialConnFor(0x0?, 0xc002e13600)
    /usr/local/go/src/net/http/transport.go:1456 +0xb0
created by net/http.(*Transport).queueForDial
    /usr/local/go/src/net/http/transport.go:1425 +0x3ea

Trying again with only 200 worked.

UPDATE: It seems to die after 242. This may vary depending on the hostname length or something, not sure.

gru.log

ISSUE TYPE
STEPS TO REPRODUCE
EXPECTED RESULTS
ACTUAL RESULTS
jacobsalmela commented 2 weeks ago

I was looking at how the go routines are running for some other work so happy to look at this

rustydb commented 5 days ago

Looking at the attached log, it seems as if there's a problem with either:

  1. Our use of WaitGroups; too many async calls at once, not enough channels
  2. Our use of HTTPClient

I'm inclined to think it's the former. Perhaps swapping out WaitGroups for something that manages our WaitGroups would help: https://github.com/nozzle/throttler