Open tcompa opened 4 days ago
Takes from a deeper review:
A single trivial endpoint (/api/alive/
) has been tested, which returns the pid
of the process on which it is being executed, which corresponds to a gunicorn worker.
As illustrated in the issues of the previous message, gunicorn does NOT introduce any load balancing activity but declines responsibility to the operating system scheduler.
Testing 5000 calls with 12 workers on a local PC (ubuntu22) we observed that:
PID Statistics:
PID: 44740, Count: 1058
PID: 44736, Count: 616
PID: 44739, Count: 870
PID: 44737, Count: 521
PID: 44732, Count: 337
PID: 44726, Count: 154
PID: 44741, Count: 412
PID: 44725, Count: 12
PID: 44734, Count: 9
PID: 44733, Count: 6
PID: 44724, Count: 3
PID: 44735, Count: 2
PID Statistics:
PID: 168883, Count: 401
PID: 168881, Count: 447
PID: 168888, Count: 421
PID: 168889, Count: 426
PID: 168887, Count: 426
PID: 168882, Count: 421
PID: 168886, Count: 399
PID: 168884, Count: 431
PID: 168880, Count: 448
PID: 168885, Count: 394
PID: 168890, Count: 398
PID: 168927, Count: 388
Further considerations must be made:
No Fix
PID Statistics:
PID: 124309, Count: 3
PID: 124305, Count: 9
PID: 124306, Count: 14
PID: 124304, Count: 9
PID: 124311, Count: 9
PID: 124314, Count: 16
PID: 124312, Count: 8
PID: 124308, Count: 8
PID: 124310, Count: 7
PID: 124313, Count: 10
PID: 124307, Count: 5
PID: 124303, Count: 2
With Fix
PID Statistics:
PID: 168880, Count: 14
PID: 168887, Count: 12
PID: 168882, Count: 5
PID: 168885, Count: 8
PID: 168888, Count: 10
PID: 168927, Count: 8
PID: 168890, Count: 6
PID: 168884, Count: 11
PID: 168889, Count: 6
PID: 168881, Count: 8
PID: 168883, Count: 6
PID: 168886, Count: 6
More on this (@tcompa):
In the current state (no patch), all sockets are on the same port. In this situation, the OS contacts the different sockets with non-homogeneous frequencies. By adding the gunicorn patch (see previous comment) and the SO_REUSEPORT
option, the N sockets are on N different ports. In this situation, the OS contacts the different sockets in a seemingly random - and therefore homogeneous - manner.
Example current state:
$ lsof -i | grep 8000
gunicorn 76496 tommaso 5u IPv4 466962 0t0 TCP localhost:8000 (LISTEN) # master gunicorn
gunicorn 76500 tommaso 5u IPv4 466962 0t0 TCP localhost:8000 (LISTEN)
gunicorn 76501 tommaso 5u IPv4 466962 0t0 TCP localhost:8000 (LISTEN)
gunicorn 76502 tommaso 5u IPv4 466962 0t0 TCP localhost:8000 (LISTEN)
gunicorn 76503 tommaso 5u IPv4 466962 0t0 TCP localhost:8000 (LISTEN)
Example with patch and --reuse-port
$ lsof -i | grep 8000
gunicorn 75823 tommaso 6u IPv4 463247 0t0 TCP localhost:8000 (LISTEN)
gunicorn 75824 tommaso 5u IPv4 459620 0t0 TCP localhost:8000 (LISTEN)
gunicorn 75825 tommaso 5u IPv4 464099 0t0 TCP localhost:8000 (LISTEN)
gunicorn 75827 tommaso 5u IPv4 457392 0t0 TCP localhost:8000 (LISTEN)
Current TLDR:
--reuse-port
together with the gunicorn patch from https://github.com/benoitc/gunicorn/pull/2938 leads to a more even distribution of requests across workers, even for small number of requests.
For the moment this is just a placeholder with relevant links: