beam-cloud / beta9

The open-source serverless GPU container runtime.
https://docs.beta9.beam.cloud
GNU Affero General Public License v3.0
297 stars 12 forks source link

Fix high gateway CPU usage #300

Closed luke-lombardi closed 2 weeks ago

luke-lombardi commented 2 weeks ago

It turns out, when an http endpoint request is pending, and the request buffer shuts down (if there are no containers running, or the serve is cancelled, etc), we are not returning from this select, and it is locking/unlocking indefinitely. Relevant pprof dump:

(pprof) top
Showing nodes accounting for 29.61s, 95.89% of 30.88s total
Dropped 301 nodes (cum <= 0.15s)
Showing top 10 nodes out of 47
      flat  flat%   sum%        cum   cum%
     9.03s 29.24% 29.24%      9.04s 29.27%  runtime.lock2
     7.94s 25.71% 54.95%      7.95s 25.74%  runtime.unlock2
     7.62s 24.68% 79.63%     28.65s 92.78%  runtime.selectgo
     1.48s  4.79% 84.42%     10.51s 34.03%  runtime.sellock
     1.03s  3.34% 87.76%      9.40s 30.44%  runtime.selunlock
     0.79s  2.56% 90.32%      0.79s  2.56%  runtime.fastrand (inline)
     0.73s  2.36% 92.68%      0.94s  3.04%  context.(*cancelCtx).Done
     0.39s  1.26% 93.94%     29.98s 97.09%  github.com/beam-cloud/beta9/pkg/abstractions/endpoint.(*RequestBuffer).ForwardRequest
     0.37s  1.20% 95.14%      0.37s  1.20%  runtime/internal/syscall.Syscall6
     0.23s  0.74% 95.89%      8.38s 27.14%  runtime.unlock (inline)