awslabs / aws-c-mqtt

C99 implementation of the MQTT 3.1.1 specification.
Apache License 2.0
93 stars 29 forks source link

mqtt + mqtt5 tests together sometimes "hang" forever #250

Closed thomas-roos closed 1 year ago

thomas-roos commented 1 year ago

Found some strange issue causing the aws-c-mqtt tests to hang. It occurs on randomly on all architectures we (meta-aws) test. Seems this happens a lot with a CodeBuild instance, not that often on my EC2. I only can say that this hang is infinitely (stopped by 8h from CodeBuild). This script does run the tests: https://github.com/aws4embeddedlinux/meta-aws/blob/999c6b8e077d6e0bc104714e92a437e82187641f/recipes-sdk/aws-c-mqtt/aws-c-mqtt/run-ptest#L288 This test seems to be hanging in this case: mqtt_connection_unsub_timeout I have just this attached log showing some traces, making me think in the direction of a hanging wait something on a socket?!

version 0.8.4.

aws-c-mqtt_ptest.log

TingDaoK commented 1 year ago

You mean your tests does run the MQTT_TESTS that commented out?

But the log you attached are ending with all the mqtt5 client service tasks,

[mqtt5-client] - id=0x56fef0: scheduled service task for time 171967973840                                                                                                                                                                                 |
[event-loop] - id=0x56c118: activity on fd 17, invoking handler.                                                                                                                                                                                           |
[event-loop] - id=0x56d7f0: detected more scheduled tasks with the next occurring at 0, using timeout of 0.                                                                                                                                                |
[event-loop] - id=0x56d7f0: waiting for a maximum of 0 ms                                                                                                                                                                                                  |
[event-loop] - id=0x56d7f0: wake up with 0 events to process.                                                                                                                                                                                              |
[socket] - id=0xb5500a98 fd=17: is readable                                                                                                                                                                                                                |
[event-loop] - id=0x56d7f0: running scheduled tasks.                                                                                                                                                                                                       |
[socket-handler] - id=0xb5511c40: socket is now readable                                                                                                                                                                                                   |
[task-scheduler] - id=0xb4b00ea8: Running socket_written_task task with <Running> status                                                                                                                                                                   |
[socket-handler] - id=0xb5511c40: invoking read. Downstream window 4294967295, max_to_read 16384                                                                                                                                                           |
[socket-handler] - static: write of size 4096, completed on channel 0xb4b01ac8                                                                                                                                                                             |
[channel] - id=0xb5501140: acquired message 0xb5509620 of capacity 16384 from pool 0xb5501348. Requested size was 16384                                                                                                                                    |
[event-loop] - id=0x56d7f0: cancelling task 0x56ff08                                                                                                                                                                                                       |
[task-scheduler] - id=0x56ff08: Running Mqtt5Service task with <Canceled> status                                                                                                                                                                           |
[mqtt5-client] - id=0x56fef0: cancelling previously scheduled service task                                                                                                                                                                                 |
[event-loop] - id=0x56d7f0: scheduling task 0x56ff08 in-thread for timestamp 144325383731                                                                                                                                                                  |
[socket] - id=0xb5500a98 fd=17: read of 4096                                                                                                                                                                                                               |
[task-scheduler] - id=0x56ff08: Scheduling Mqtt5Service task for future execution at time 144325383731                                                                                                                                                     |
[socket-handler] - id=0xb5511c40: read 4096 from socket                                                                                                                                                                                                    |
[mqtt5-client] - id=0x56fef0: scheduled service task for time 144325383731                                                                                                                                                                                 |
[channel] - id=0xb5501140: sending read message of size 4096, from slot 0xb5505598 to slot 0xb5505570 with handler 0xb5511a78.                                                                                                                             |
[event-loop] - id=0x56d7f0: detected more scheduled tasks with the next occurring at 7060819, using timeout of 7.                                                                                                                                          |
[channel] - id=0xb5501140: acquired message 0xb5509620 of capacity 12288 from pool 0xb5501348. Requested size was 12288                                                                                                                                    |
[event-loop] - id=0x56d7f0: waiting for a maximum of 7 ms                                                                                                                                                                                                  |
[socket] - id=0xb5500a98 fd=17: read of -1                                                                                                                                                                                                                 |
[socket] - id=0xb5500a98 fd=17: read would block                                                                                                                                                                                                           |
[event-loop] - id=0x56d7f0: wake up with 1 events to process.                                                                                                                                                                                              |
[socket-handler] - id=0xb55055a8: total read on this tick 4096                                                                                                                                                                                             |
[event-loop] - id=0x56d7f0: activity on fd 16, invoking handler.                                                                                                                                                                                           |
[socket-handler] - id=0xb5511c40: out of data to read on socket. Waiting on event-loop notification.      

This test seems to be hanging in this case: mqtt_connection_unsub_timeout

Why do you think it hangs from mqtt_connection_unsub_timeout?

Can you provide more details?

thomas-roos commented 1 year ago

Thank you to look into this - retried and found when disabling those tests it does not hang anymore - so I changed the title a bit: https://github.com/aws4embeddedlinux/meta-aws/blob/812452dcaa4b5016ad66e117bc3775a3ab692659/recipes-sdk/aws-c-mqtt/aws-c-mqtt/run-ptest#L261-L287

thomas-roos commented 1 year ago

mqtt_connection_unsub_timeout was my guess because it was the last I saw in the logs, sadly I do not have them anymore - the 10000 lines limit exporting logs... But what I'm basically do in this test is running your unit tests and if they are supposed to run all after one another, it seems that there is either a bug in the test mock code or a bug in aws-c-mqtt itself.

thomas-roos commented 1 year ago

did spend some time to reproduce the bug - and I have something. What I did is to run the tests in a while true loop and check if it hangs - after an hour it hangs on my EC2. Should be easy to reproduce with the script (all tests enabled), and this command:

while true; do ./run-ptest > log.log  2>&1; done &
tail -f -s 10 log.log 

Here is the logfile. mqtt-hang.log

And top shows that it hangs on this test: mqtt_connection_not_resend_packets_on_healthy_connection

bretambrose commented 1 year ago

The underlying insta-reconnect is still possible and fixing it with a refactoring of the 311 client is still on our roadmap. But in the meantime, we discovered an error in the 311 client's reconnect timing that was causing instant reconnects instead of minimum delay reconnects. In fixing that error, the CI issues look to have disappeared. We will make the long-term fix/refactor as soon as we can, but for now, I believe the direct issue is addressed.