caddyserver / caddy

Fast and extensible multi-platform HTTP/1-2-3 web server with automatic HTTPS
https://caddyserver.com
Apache License 2.0
58.21k stars 4.03k forks source link

Build with PGO #5588

Open bt90 opened 1 year ago

bt90 commented 1 year ago

go 1.21 will ship with PGO support enabled by default. Maybe we can squeeze a little bit performance out of this.

https://go.dev/doc/pgo

francislavoie commented 1 year ago

I'm not sure PGO will be a good fit for Caddy. A general purpose webserver that's user-configured can be used in infinite ways, so there's no one profile that would be the best fit.

I think it's unlikely that Matt or I will spend time on this, but contributions are welcome.

mohammed90 commented 1 year ago

I looked into that and thought about it. The optimization depends on the profiles of the production load, analyzing the executed paths, and optimizing the machine code based on known historic workload. Wouldn't this be different for every user? For instance, my own deployment doesn't use any of the FastCGI features, so any optimization based on profiles of my production deployment will not optimize FastCGI aspects. Different users utilize different parts of Caddy, so their preferred optimizations will be different.

What do you think?

bt90 commented 1 year ago

It's unlikely that we can cover all usecases but that's also not the point of PGO. The detection and optimization of shared hot codepaths would be good enough.

If I understood it correctly, we can also merge profiles. So we could generate 2-3 based on frequently used workloads:

bt90 commented 1 year ago

Is PGO limited to our code or is this also applied to the dependencies we are using? The benefit would be a lot greater if this would also apply to the code of e.g quic-go

Edit: it's pointed out in the FAQ:

PGO in Go applies to the entire program. All packages are rebuilt to consider potential profile-guided optimizations, including standard library packages [...] , including packages in dependencies

mholt commented 1 year ago

The little bit I've read about PGO (as of this morning :sweat_smile:) is that it shouldn't slow down a program, but can offer nominal performance improvements in hot paths with a slightly larger binary size and slightly longer compile times.

I agree with @bt90, maybe we generate profiles that utilize primarily:

Of course, because we don't have telemetry (:cry:) we have no idea what the popular configurations are, so we can only guess. (Thank you, unnecessary community backlash of 2018, for leaving us in the dark.)

I'd definitely be open to trying this after releasing 2.7.

Forza-tng commented 1 year ago

Perhaps we can have an option to turn on profiling with xcaddy, then th user can run their workloads for a bit and then run xcaddy again with the profile a input? At least, this is how i do it with gcc pgo builds.

mholt commented 1 year ago

Profiles can be obtained from any Caddy instance, for years now -- just go to :2019/debug/pprof to see the profile options.

I actually collected a profile this week from our Caddy website and deployed a pgo-optimized instance of Caddy and noticed only barely any speedup... quite insignificant (maybe 2-4% depending on the run of the load test).

Maybe that's significant enough to warrant it, and maybe our profile didn't have enough data (I ran it for an hour but it's not a very busy site compared to big enterprise services).

Forza-tng commented 1 year ago

I had a go at a simple test by benchmarking using h2load against a caddy fileserver. I run each test 3 times and restarted caddy and ran 3 times again. The variation was very small, which makes the results more confident.

h2load -n1000 -c10 -m10 "https://mirrors.tnonline.net/"

Result without PGO:

finished in 1.12s, 888.94 req/s, 16.92MB/s
requests: 1000 total, 1000 started, 1000 done, 1000 succeeded, 0 failed, 0 errored, 0 timeout
status codes: 1000 2xx, 0 3xx, 0 4xx, 0 5xx
traffic: 19.03MB (19953887) total, 8.08KB (8277) headers (space savings 95.57%), 18.99MB (19909000) data
                     min         max         mean         sd        +/- sd
time for request:     1.65ms    576.89ms     98.96ms    113.61ms    84.80%
time for connect:     5.99ms     21.65ms     14.14ms      5.27ms    60.00%
time to 1st byte:    33.93ms    156.91ms     58.03ms     36.55ms    90.00%
req/s           :      88.96      123.50       97.79       10.42    90.00%

Result with PGO:

finished in 919.88ms, 1087.10 req/s, 20.69MB/s
requests: 1000 total, 1000 started, 1000 done, 1000 succeeded, 0 failed, 0 errored, 0 timeout
status codes: 1000 2xx, 0 3xx, 0 4xx, 0 5xx
traffic: 19.03MB (19953864) total, 8.06KB (8254) headers (space savings 95.59%), 18.99MB (19909000) data
                     min         max         mean         sd        +/- sd
time for request:     1.83ms    550.37ms     77.79ms    103.68ms    84.20%
time for connect:     4.64ms     23.81ms     14.22ms      6.52ms    60.00%
time to 1st byte:    32.42ms    182.82ms     91.29ms     60.17ms    70.00%
req/s           :     108.84      137.31      116.76        9.73    80.00%

This is a 22% increase in handled requests per second. Not bad IMHO.

Profile was collected with go tool pprof "http://127.0.0.1:2019/debug/pprof/profile?seconds=600" while I was browsing the server from my phone, including other domains with mediawiki and nextcloud using php-fpm. Basically normal usage pattern for this server.

The build script I use is:

#!/bin/sh
export XCADDY_SETCAP=1
export GOARCH="amd64"
export GOAMD64="v3"
export CGO_ENABLED=1
export GOFLAGS="-pgo=/usr/src/caddy/default.pgo"
/root/go/bin/xcaddy build --with github.com/caddyserver/caddy/v2=/usr/src/caddy/git/caddy  --with github.com/ueffel/caddy-brotli --with github.com/caddyserver/transform-encoder --with github.com/caddyserver/cache-handler --with github.com/kirsch33/realip --with github.com/git001/caddyv2-upload
strip -s -v caddy
setcap cap_net_bind_service=+ep ./caddy

Graph (PGO) profile001 pgo

Graph (no PGO) profile001

Forza-tng commented 1 year ago

I'm not sure PGO will be a good fit for Caddy. A general purpose webserver that's user-configured can be used in infinite ways, so there's no one profile that would be the best fit.

I think it's unlikely that Matt or I will spend time on this, but contributions are welcome.

I agree. PGO can be highly dependant on use-case and the host hardware configuration.

It may be better to include PGO support as an option with xcaddy? xcaddy --profile=/path/to.pprof. In addition, we can document how to gather several short samples over time for the running caddy instance, how to merge them and feed the result to xcaddy.

Quoting from the Go PGO page below:

A more robust strategy is collecting multiple profiles at different times from different instances to limit the impact of differences between individual instance profiles. Multiple profiles may then be merged into a single profile for use with PGO.

Many organizations run “continuous profiling” services that perform this kind of fleet-wide sampling profiling automatically, which could then be used as a source of profiles for PGO.

go tool pprof -proto sample1.pprof sample2.pprof > merged.pprof

https://go.dev/doc/pgo

mholt commented 1 year ago

I know @WeidiDeng has merged some Caddy profiles successfully for pgo.

Maybe I should ask the community to submit their profiles and we'll try merging them and see if that helps. Seeing your results above is encouraging so maybe we just need a variety.

Forza-tng commented 1 year ago

I'm thinking that the xcaddy option to build caddy with profile input is a good first step. What do you think of opening a issue at https://github.com/caddyserver/xcaddy ?

mholt commented 10 months ago

@Forza-tng That sounds like a plan. See https://github.com/caddyserver/xcaddy/issues/163