eclipse / paho.golang

Go libraries
Other
327 stars 92 forks source link

New pinger implementation #222

Closed vishnureddy17 closed 7 months ago

vishnureddy17 commented 7 months ago

close #137

vishnureddy17 commented 7 months ago

Looks like this change uncovered a potential deadlock in client_test.go. This may be an issue with paho itself or it might just be the way the test is written. I'll make a write-up explaining the issue soon.

vishnureddy17 commented 7 months ago

Ok, so the issue is really the way the test server is written. I'll work on including a fix to that in this PR as well.

The test server does packet reads and packet writes in the same thread/goroutine. Furthermore, the conn created by net.Pipe() is unbuffered. This can result in a deadlock when the test server is trying to write a packet and the incoming() goroutine in the client is also trying to write a packet. Both the server and the client are trying to write and neither side is reading, resulting in a deadlock, as the conn is unbuffered.

As an example, the logs below show that the test server is stuck on writing a PINGRESP and the incoming() goroutine is stuck on writing a PUBREL:

=== RUN   TestClientPublishQoS2
    test.go:38: 2024-01-09T08:18:02.521885195-05:00 ClientPublishQoS2: no session present - cleaning session state
    test.go:38: 2024-01-09T08:18:02.521955196-05:00 ClientPublishQoS2: State.clean() called
    test.go:48: 2024-01-09T08:18:02.521978196-05:00 ClientPublishQoS2:0 inflight transactions upon connection
    test.go:48: 2024-01-09T08:18:02.522003096-05:00 ClientPublishQoS2:retransmitting 0 messages
    test.go:48: 2024-01-09T08:18:02.522033496-05:00 ClientPublishQoS2:sending message to test/2
    test.go:38: 2024-01-09T08:18:02.522061596-05:00 ClientPublishQoS2: sending QoS12 message
    test.go:38: 2024-01-09T08:18:02.522107896-05:00 TestServer: test server received a control packet: PUBLISH
    test.go:38: 2024-01-09T08:18:02.522130696-05:00 TestServer: received PUBLISH: PacketID:1 QOS:2 Topic:test/2 Duplicate:false Retain:false Payload:
        test payload
        Properties

    test.go:38: 2024-01-09T08:18:02.522193896-05:00 TestServer: sending pubrec
    test.go:38: 2024-01-09T08:18:02.522250896-05:00 TestServer: sent pubrec
    test.go:38: 2024-01-09T08:18:02.522141096-05:00 ClientPublishQoS2: sendPingreq() sending PINGREQ packet
    test.go:38: 2024-01-09T08:18:02.522325796-05:00 TestServer: test server received a control packet: PINGREQ
    test.go:38: 2024-01-09T08:18:02.522334796-05:00 TestServer: test server sending pingresp
    test.go:38: 2024-01-09T08:18:02.522340496-05:00 ClientPublishQoS2: PacketSent() resetting timer
    test.go:38: 2024-01-09T08:18:02.522410996-05:00 ClientPublishQoS2: received PUBREC packet with id  1
    test.go:38: 2024-01-09T08:18:02.522443396-05:00 ClientPublishQoS2: sending PUBREL for 1
    test.go:38: 2024-01-09T08:18:02.522470896-05:00 ClientPublishQoS2: sendPingreq() sent PINGREQ packet, waiting for PINGRESP
    test.go:38: 2024-01-09T08:18:02.522512396-05:00 ClientPublishQoS2: PacketSent() resetting timer
panic: test timed out after 30s
running tests:
        TestClientPublishQoS2 (30s)

goroutine 50 [running]:
testing.(*M).startAlarm.func1()
        /usr/local/go/src/testing/testing.go:2259 +0x3b9
created by time.goFunc
        /usr/local/go/src/time/sleep.go:176 +0x2d

goroutine 1 [chan receive]:
testing.(*T).Run(0xc00008aea0, {0x75b57f?, 0x51797c?}, 0x788ea0)
        /usr/local/go/src/testing/testing.go:1649 +0x3c8
testing.runTests.func1(0x9df720?)
        /usr/local/go/src/testing/testing.go:2054 +0x3e
testing.tRunner(0xc00008aea0, 0xc00014dc48)
        /usr/local/go/src/testing/testing.go:1595 +0xff
testing.runTests(0xc0000aae60?, {0x9d9a40, 0x1a, 0x1a}, {0x41829f?, 0xc00014dd08?, 0x9dee40?})
        /usr/local/go/src/testing/testing.go:2052 +0x445
testing.(*M).Run(0xc0000aae60)
        /usr/local/go/src/testing/testing.go:1925 +0x636
main.main()
        _testmain.go:97 +0x19c

goroutine 34 [select]:
github.com/eclipse/paho.golang/paho.(*Client).publishQoS12(0xc000160ea0, {0x7de7b8, 0xa0d560}, 0xc0000b6e80, {0x1?})
        /home/vishnureddy/projects/paho-pinger-fix/paho.golang/paho/client.go:871 +0x345
github.com/eclipse/paho.golang/paho.(*Client).PublishWithOptions(0xc000160ea0, {0x7de7b8, 0xa0d560}, 0xc0000b6e40, {0xc0000a0d60?})
        /home/vishnureddy/projects/paho-pinger-fix/paho.golang/paho/client.go:840 +0x22e
github.com/eclipse/paho.golang/paho.(*Client).Publish(...)
        /home/vishnureddy/projects/paho-pinger-fix/paho.golang/paho/client.go:784
github.com/eclipse/paho.golang/paho.TestClientPublishQoS2(0xc000160d00)
        /home/vishnureddy/projects/paho-pinger-fix/paho.golang/paho/client_test.go:308 +0x609
testing.tRunner(0xc000160d00, 0x788ea0)
        /usr/local/go/src/testing/testing.go:1595 +0xff
created by testing.(*T).Run in goroutine 1
        /usr/local/go/src/testing/testing.go:1648 +0x3ad

goroutine 35 [select]:
net.(*pipe).write(0xc0000e0480, {0xc0000da940, 0x2, 0x40})
        /usr/local/go/src/net/pipe.go:194 +0x305
net.(*pipe).Write(0x7f8dc0399d90?, {0xc0000da940?, 0xc0000e0480?, 0x9df720?})
        /usr/local/go/src/net/pipe.go:174 +0x1c
net.(*Buffers).WriteTo(0xc0000a46c0, {0x7f8dc03b4358?, 0xc0000e0480?})
        /usr/local/go/src/net/net.go:726 +0xb9
github.com/eclipse/paho.golang/packets.(*ControlPacket).WriteTo(0xc0000d2b00, {0x7f8dc03b4358, 0xc0000e0480})
        /home/vishnureddy/projects/paho-pinger-fix/paho.golang/packets/packets.go:345 +0x346
github.com/eclipse/paho.golang/internal/basictestserver.(*TestServer).Run(0xc0000e0400)
        /home/vishnureddy/projects/paho-pinger-fix/paho.golang/internal/basictestserver/server.go:166 +0x145d
created by github.com/eclipse/paho.golang/paho.TestClientPublishQoS2 in goroutine 34
        /home/vishnureddy/projects/paho-pinger-fix/paho.golang/paho/client_test.go:278 +0x2a5

goroutine 36 [select]:
net.(*pipe).write(0xc0000e0500, {0xc0000daa80, 0x2, 0x40})
        /usr/local/go/src/net/pipe.go:194 +0x305
net.(*pipe).Write(0x6eb420?, {0xc0000daa80?, 0xc0000e0500?, 0x9df720?})
        /usr/local/go/src/net/pipe.go:174 +0x1c
net.(*Buffers).WriteTo(0xc0000a4738, {0x7f8dc03b4358?, 0xc0000e0500?})
        /usr/local/go/src/net/net.go:726 +0xb9
github.com/eclipse/paho.golang/packets.(*ControlPacket).WriteTo(0xc000194cd0, {0x7f8dc03b4358, 0xc0000e0500})
        /home/vishnureddy/projects/paho-pinger-fix/paho.golang/packets/packets.go:345 +0x346
github.com/eclipse/paho.golang/packets.(*Pubrel).WriteTo(0xc0000a4510?, {0x7f8dc03b4358?, 0xc0000e0500?})
        /home/vishnureddy/projects/paho-pinger-fix/paho.golang/packets/pubrel.go:76 +0x56
github.com/eclipse/paho.golang/paho/session/state.(*State).PacketReceived(0xc0000c62c0, 0xc0000d2900, 0x0?)
        /home/vishnureddy/projects/paho-pinger-fix/paho.golang/paho/session/state/state.go:537 +0x62c
github.com/eclipse/paho.golang/paho.(*Client).incoming(0xc000160ea0)
        /home/vishnureddy/projects/paho-pinger-fix/paho.golang/paho/client.go:507 +0x2e8
github.com/eclipse/paho.golang/paho.TestClientPublishQoS2.func1()
        /home/vishnureddy/projects/paho-pinger-fix/paho.golang/paho/client_test.go:294 +0x52
created by github.com/eclipse/paho.golang/paho.TestClientPublishQoS2 in goroutine 34
        /home/vishnureddy/projects/paho-pinger-fix/paho.golang/paho/client_test.go:292 +0x48d

goroutine 37 [chan receive]:
github.com/eclipse/paho.golang/paho.(*DefaultPinger).Run(0xc000144690, {0x7dffd8?, 0xc0000e0500?}, 0x0?)
        /home/vishnureddy/projects/paho-pinger-fix/paho.golang/paho/pinger.go:78 +0x196
github.com/eclipse/paho.golang/paho.TestClientPublishQoS2.func2()
        /home/vishnureddy/projects/paho-pinger-fix/paho.golang/paho/client_test.go:298 +0x6c
created by github.com/eclipse/paho.golang/paho.TestClientPublishQoS2 in goroutine 34
        /home/vishnureddy/projects/paho-pinger-fix/paho.golang/paho/client_test.go:296 +0x4cf

goroutine 5 [select]:
github.com/eclipse/paho.golang/paho.(*DefaultPinger).sendPingreq(0xc000144690)
        /home/vishnureddy/projects/paho-pinger-fix/paho.golang/paho/pinger.go:136 +0x2c5
created by time.goFunc
        /usr/local/go/src/time/sleep.go:176 +0x2d
exit status 2
FAIL    github.com/eclipse/paho.golang/paho     30.006s

Open question: should the paho client avoid doing writes in the incoming() goroutine? This would potentially be more performant (in the case where the connection is full-duplex) and safer, but the current way the client is implemented is unlikely to cause major issues.

cc @MattBrittan

MattBrittan commented 7 months ago

Note that, while basictestserver uses net.Pipe, testserver uses a different technique to avoid similar issues to the one you found (spent ages tracing an issue...). Might be worth moving basictestserver over to the alternate approach.

Will try to have a look at the rest of your comments today (but realistically it may be later in the week).

MattBrittan commented 7 months ago

Open question: should the paho client avoid doing writes in the incoming() goroutine? This would potentially be more performant (in the case where the connection is full-duplex) and safer, but the current way the client is implemented is unlikely to cause major issues.

I've not really seen any issues with this (other than those caused by net.Pipe :-) ). My feeling is that the simplicity of the current approach is an important consideration; separating writes would complicate things a bit so I think we would need to establish that it does improve performance/safety (for standard net.Conn I believe the current approach is safe but may well have missed something). I'll do some more on this project tomorrow (have been doing some tidying on the python project) as I'd like to have a release candidate by the end of the week (so I can run it myself for a few weeks before releasing it). I think we are just about there...

One other thing - I'll commit PR #221 tomorrow; this will lead to the need for some changes in your PR (alternatively I can commit your PR first if it's ready). #221 comes out of a few of the discussions with you; having the config public makes it far too easy to introduce race conditions...

vishnureddy17 commented 7 months ago

Converted to draft because I need to deal with the merge conflict.

vishnureddy17 commented 7 months ago

Sorry for the delay, this is now ready for review @MattBrittan

vishnureddy17 commented 7 months ago

An important thing to note is that this PR doesn't modify the SessionManager to be able to call PacketSent() when it does a write.