envoyproxy / envoy

Cloud-native high-performance edge/middle/service proxy
https://www.envoyproxy.io
Apache License 2.0
24.95k stars 4.8k forks source link

Allow worker CPU affinity to be set #14619

Open hazelnutsgz opened 3 years ago

hazelnutsgz commented 3 years ago

I was wondering is there any feature/API that could bind the worker to a dedicated CPU? Something like code in tools/perf/bench/epoll-wait.c

if (!noaffinity) {
    CPU_ZERO(&cpuset);
    CPU_SET(cpu->map[i % cpu->nr], &cpuset);

    ret = pthread_attr_setaffinity_np(&thread_attr, sizeof(cpu_set_t), &cpuset);
    if (ret)
        err(EXIT_FAILURE, "pthread_attr_setaffinity_np");

    attrp = &thread_attr;
}

ret = pthread_create(&w->thread, attrp, workerfn,
                   (void *)(struct worker *) w);

Thanks~

mattklein123 commented 3 years ago

I don't believe we have this today but could be useful. cc @rojkov @jmarantz

jmarantz commented 3 years ago

To answer the question, I don't think there's a current feature in Envoy that would enable you to manually bind workers to cores.

I am assuming you are suggesting this because you see some potential for performance improvement relative to the automatic assignments done by the CPU/OS for some workload.

Here are some possible complexities with this path:

The number of cores you reserve for these auxiliary threads may be dependent on your hardware, your workload, how often you get xDS updates, and which extensions you have enabled.

My suggestion is to try to visualize the cpu usage per thread and see if you can find some behavior from automated thread/core bindings that you feel it's worth overriding. Then do a ton of benchmarks (see https://github.com/envoyproxy/nighthawk) to prove you've made things better.

It would be great to get some perf benefit from this!

hazelnutsgz commented 3 years ago

Really appreciated replies from you guys. I haven't conduct the systematic benchmarking yet.

Actually what I did(I knew it is not convincing lol) is that I wrote a multiple-thread epoll proxy(with reuseport enabled, each thread serving for all potential ports) FROM SCRATCH, and did observed the performance gap concerning CPU binding.

My suggestion is to try to visualize the cpu usage per thread and see if you can find some behavior from automated >thread/core bindings that you feel it's worth overriding. Then do a ton of benchmarks (see >https://github.com/envoyproxy/nighthawk) to prove you've made things better.

Makes a lot of sense, I would take it.

rojkov commented 3 years ago

It would be interesting to see how much perf gain could be achieved with thread pinning vs process pinning with taskset within the same NUMA node. If your work load is orchestrated with Kubrenetes the latter can be configured with a topology aware kubelet.

soulxu commented 3 years ago

@hazelnutsgz are you going to work on this? If not, I'm interested to take a look at this issues, see what we can found.

samene commented 2 years ago

@soulxu I'm interested in this feature. Do you have anything forked?

soulxu commented 2 years ago

@soulxu I'm interested in this feature. Do you have anything forked?

please go ahead, I'm working other stuff, really don't have bandwidth to work on this for now.