codypiersall / pynng

Python bindings for Nanomsg Next Generation.
https://pynng.readthedocs.io
MIT License
265 stars 58 forks source link

Sub/Pub socket high latency under low load #73

Open Zhou-Ao opened 3 years ago

Zhou-Ao commented 3 years ago

I am testing Nanomsg NNG PUB/SUB socket performance with pynng .

In the test program below, SUB socket is used as a server to receive messages and PUB socket is used as a client to receive messages. The content of a message is the simply time when it is sent by the PUB client. The SUB server measures the latency by comparing between the time when the message is received time and the time (which is the message sent time) inside the message.

For 1,000 messages sent over ~1 second, I find two issues from the performance test: 1) The latency is very high between PUB and SUB sockets in couple of milliseconds. However, under such low load, I would expect the latency to be in scale of microseconds.

2) The latency performance improves if I make the PUB client sleep in a very short period of time. I understand that sleeping would reduce the load. But I would expect it makes not much difference given the low load. I am surprised that the latency can shoot up to 100ms if there is no sleep.

I am not sure if this issue relates to python GIL. Will my main PUB client thread compete against NNG internal I/O thread for GIL without a sleep?

Any ideas will be welcome and appreciated.

The testing programs are as below:

def run_nng_sub():
    import numpy as np
    import time, pickle, datetime, threading
    from pynng import Sub0, Timeout        

    address = 'tcp://127.0.0.1:31313'
    latencies = []
    with Sub0(listen=address, recv_timeout=100) as sub:
        sub.subscribe(b'')
        time.sleep(0.05)

        while  True:
            try:
                b_datetime = sub.recv()
                received_time = datetime.datetime.utcnow()

                sent_time = pickle.loads(b_datetime)
                latency = received_time - sent_time
                latencies.append(latency)
                print(f'latency: {latency}')
            except Timeout:
                if len(latencies) > 1000:
                    break

    print(f'latency:\nmax {np.max(latencies)}\naverage {np.mean(latencies)}\nmedian {np.median(latencies)}')

def run_nng_pub():
    import time, pickle, datetime
    from pynng import Pub0

    address = 'tcp://127.0.0.1:31313'
    with Pub0(dial=address) as pub:
        time.sleep(0.05)

        for i in range(1000):
            pub.send(pickle.dumps(datetime.datetime.utcnow()))
            # time.sleep(1e-5)

And the performance results on my PC are below:

No sleep
latency:
max 0:00:00.165002
average 0:00:00.025910
median 0:00:00.007000

time.sleep(1e-3)
latency:
max 0:00:00.045002
average 0:00:00.002442
median 0:00:00

time.sleep(1e-4)
latency:
max 0:00:00.051001
average 0:00:00.006631
median 0:00:00.001998

time.sleep(1e-5)
latency:
max 0:00:00.046003
average 0:00:00.008652
median 0:00:00.004000
codypiersall commented 3 years ago

(I just edited your comment to get syntax highlighting, didn't change content at all)

codypiersall commented 3 years ago

Thanks for opening the issue! It's very nice to have the sample program, too.

I'm not sure exactly what's going on, but I think it's some unfortunate interaction between pub/sub semantics and the GIL. With no sleep, the sending thread never gives up the GIL, and that means the receiving thread never has a chance to wake up and receive it. The receiving socket will then have a delayed message by the time the sending thread yields the GIL, after dropping a bunch of messages. On my computer, without a sleep, about 93% of the messages are dropped.

Whenever there is no sleep at all, the latency on my computer is even worse than on yours: a mean of 0.5 seconds, and max of 1.17 seconds.