Open bt90 opened 1 year ago
I'm not sure PGO will be a good fit for Caddy. A general purpose webserver that's user-configured can be used in infinite ways, so there's no one profile that would be the best fit.
I think it's unlikely that Matt or I will spend time on this, but contributions are welcome.
I looked into that and thought about it. The optimization depends on the profiles of the production load, analyzing the executed paths, and optimizing the machine code based on known historic workload. Wouldn't this be different for every user? For instance, my own deployment doesn't use any of the FastCGI features, so any optimization based on profiles of my production deployment will not optimize FastCGI aspects. Different users utilize different parts of Caddy, so their preferred optimizations will be different.
What do you think?
It's unlikely that we can cover all usecases but that's also not the point of PGO. The detection and optimization of shared hot codepaths would be good enough.
If I understood it correctly, we can also merge profiles. So we could generate 2-3 based on frequently used workloads:
Is PGO limited to our code or is this also applied to the dependencies we are using? The benefit would be a lot greater if this would also apply to the code of e.g quic-go
Edit: it's pointed out in the FAQ:
PGO in Go applies to the entire program. All packages are rebuilt to consider potential profile-guided optimizations, including standard library packages [...] , including packages in dependencies
The little bit I've read about PGO (as of this morning :sweat_smile:) is that it shouldn't slow down a program, but can offer nominal performance improvements in hot paths with a slightly larger binary size and slightly longer compile times.
I agree with @bt90, maybe we generate profiles that utilize primarily:
Of course, because we don't have telemetry (:cry:) we have no idea what the popular configurations are, so we can only guess. (Thank you, unnecessary community backlash of 2018, for leaving us in the dark.)
I'd definitely be open to trying this after releasing 2.7.
Perhaps we can have an option to turn on profiling with xcaddy, then th user can run their workloads for a bit and then run xcaddy again with the profile a input? At least, this is how i do it with gcc pgo builds.
Profiles can be obtained from any Caddy instance, for years now -- just go to :2019/debug/pprof
to see the profile options.
I actually collected a profile this week from our Caddy website and deployed a pgo-optimized instance of Caddy and noticed only barely any speedup... quite insignificant (maybe 2-4% depending on the run of the load test).
Maybe that's significant enough to warrant it, and maybe our profile didn't have enough data (I ran it for an hour but it's not a very busy site compared to big enterprise services).
I had a go at a simple test by benchmarking using h2load
against a caddy fileserver. I run each test 3 times and restarted caddy and ran 3 times again. The variation was very small, which makes the results more confident.
h2load -n1000 -c10 -m10 "https://mirrors.tnonline.net/"
Result without PGO:
finished in 1.12s, 888.94 req/s, 16.92MB/s
requests: 1000 total, 1000 started, 1000 done, 1000 succeeded, 0 failed, 0 errored, 0 timeout
status codes: 1000 2xx, 0 3xx, 0 4xx, 0 5xx
traffic: 19.03MB (19953887) total, 8.08KB (8277) headers (space savings 95.57%), 18.99MB (19909000) data
min max mean sd +/- sd
time for request: 1.65ms 576.89ms 98.96ms 113.61ms 84.80%
time for connect: 5.99ms 21.65ms 14.14ms 5.27ms 60.00%
time to 1st byte: 33.93ms 156.91ms 58.03ms 36.55ms 90.00%
req/s : 88.96 123.50 97.79 10.42 90.00%
Result with PGO:
finished in 919.88ms, 1087.10 req/s, 20.69MB/s
requests: 1000 total, 1000 started, 1000 done, 1000 succeeded, 0 failed, 0 errored, 0 timeout
status codes: 1000 2xx, 0 3xx, 0 4xx, 0 5xx
traffic: 19.03MB (19953864) total, 8.06KB (8254) headers (space savings 95.59%), 18.99MB (19909000) data
min max mean sd +/- sd
time for request: 1.83ms 550.37ms 77.79ms 103.68ms 84.20%
time for connect: 4.64ms 23.81ms 14.22ms 6.52ms 60.00%
time to 1st byte: 32.42ms 182.82ms 91.29ms 60.17ms 70.00%
req/s : 108.84 137.31 116.76 9.73 80.00%
This is a 22% increase in handled requests per second. Not bad IMHO.
Profile was collected with go tool pprof "http://127.0.0.1:2019/debug/pprof/profile?seconds=600"
while I was browsing the server from my phone, including other domains with mediawiki and nextcloud using php-fpm. Basically normal usage pattern for this server.
The build script I use is:
#!/bin/sh
export XCADDY_SETCAP=1
export GOARCH="amd64"
export GOAMD64="v3"
export CGO_ENABLED=1
export GOFLAGS="-pgo=/usr/src/caddy/default.pgo"
/root/go/bin/xcaddy build --with github.com/caddyserver/caddy/v2=/usr/src/caddy/git/caddy --with github.com/ueffel/caddy-brotli --with github.com/caddyserver/transform-encoder --with github.com/caddyserver/cache-handler --with github.com/kirsch33/realip --with github.com/git001/caddyv2-upload
strip -s -v caddy
setcap cap_net_bind_service=+ep ./caddy
Graph (PGO)
Graph (no PGO)
I'm not sure PGO will be a good fit for Caddy. A general purpose webserver that's user-configured can be used in infinite ways, so there's no one profile that would be the best fit.
I think it's unlikely that Matt or I will spend time on this, but contributions are welcome.
I agree. PGO can be highly dependant on use-case and the host hardware configuration.
It may be better to include PGO support as an option with xcaddy? xcaddy --profile=/path/to.pprof
. In addition, we can document how to gather several short samples over time for the running caddy instance, how to merge them and feed the result to xcaddy.
Quoting from the Go PGO page below:
A more robust strategy is collecting multiple profiles at different times from different instances to limit the impact of differences between individual instance profiles. Multiple profiles may then be merged into a single profile for use with PGO.
Many organizations run “continuous profiling” services that perform this kind of fleet-wide sampling profiling automatically, which could then be used as a source of profiles for PGO.
go tool pprof -proto sample1.pprof sample2.pprof > merged.pprof
I know @WeidiDeng has merged some Caddy profiles successfully for pgo.
Maybe I should ask the community to submit their profiles and we'll try merging them and see if that helps. Seeing your results above is encouraging so maybe we just need a variety.
I'm thinking that the xcaddy option to build caddy with profile input is a good first step. What do you think of opening a issue at https://github.com/caddyserver/xcaddy ?
@Forza-tng That sounds like a plan. See https://github.com/caddyserver/xcaddy/issues/163
go 1.21 will ship with PGO support enabled by default. Maybe we can squeeze a little bit performance out of this.
https://go.dev/doc/pgo