ARMmbed / mbed-os

Arm Mbed OS is a platform operating system designed for the internet of things
https://mbed.com
Other
4.67k stars 2.98k forks source link

Ublox odin - Wi-Fi interface fails endurance testing/jams #12497

Closed JanneKiiskila closed 4 years ago

JanneKiiskila commented 4 years ago

Description of defect

Try running mbed-os-example-pelion with UBLOX_EVK_ODIN_W2 target, leave it running for a longer time. Once tick count on /3200/0/5501 reaches around 1900-2200 - the connection will fail and all DNS queries against that interface will fail. Seems the Wi-Fi stack gets somehow stuck and it never recovers.

With K66F/Ethernet - you can plug out the whole cable for a long time, then plug it back in and everything recovers nicely.

Target(s) affected by this defect ?

UBLOX_EVK_ODIN_W2

Toolchain(s) (name and version) displaying this defect ?

gcc-arm-none-eabi-9-2019-q4-major

What version of Mbed-os are you using (tag or sha) ?

mbed-os.5.15.1

What version(s) of tools are you using. List all that apply (E.g. mbed-cli)

mbed-cli 1.10.1

How is this defect reproduced ?

  1. Use the repository mbed-os-example-pelion or clone it, if you do not have it yet.
  2. Get your developer certificate from Pelion portal.
  3. Configure Wi-Fi credentials to mbed_app.json, optinally turn traces on.
  4. Compile the target - `mbed compile -t GCC_ARM - m UBLOX_EVK_ODIN_W2'
  5. Flash in the binary
  6. Reset board and start following the progress.

-> At some point of time you will have network error/lose Wi-Fi connectivity.

JanneKiiskila commented 4 years ago

Attn: @ARMmbed/team-ublox @MarceloSalazar

JanneKiiskila commented 4 years ago

Log example below;

Counter 283
[INFO][COAP]: ======== Incoming CoAP package ========
[INFO][COAP]: Status:       COAP_STATUS_OK
[INFO][COAP]: Code:     COAP_MSG_CODE_EMPTY
[INFO][COAP]: Type:     COAP_MSG_TYPE_ACKNOWLEDGEMENT
[INFO][COAP]: Id:       25
[INFO][COAP]: Payload len:  0
[INFO][COAP]: ======== End of CoAP package ========
[INFO][mClt]: M2MNsdlInterface::send_resource_observation - uri 3200/0/5501
[INFO][COAP]: ======== Outgoing CoAP package ========
[INFO][COAP]: Status:       COAP_STATUS_OK
[INFO][COAP]: Code:     COAP_MSG_CODE_RESPONSE_CONTENT
[INFO][COAP]: Type:     COAP_MSG_TYPE_CONFIRMABLE
[INFO][COAP]: Id:       26
[INFO][COAP]: Token:        2d:a1:0f:0b
[INFO][COAP]: Content-type: 0
[INFO][COAP]: Payload len:  3
[INFO][COAP]: Max-age:      0
[INFO][COAP]: Observe:      247
[INFO][COAP]: ======== End of CoAP package ========
Counter 284
[INFO][mClt]: M2MNsdlInterface::observation_to_be_sent() - send already in progress
Counter 285
[INFO][COAP]: ======== Outgoing CoAP package ========
[INFO][COAP]: Status:       COAP_STATUS_OK
[INFO][COAP]: Code:     COAP_MSG_CODE_RESPONSE_CONTENT
[INFO][COAP]: Type:     COAP_MSG_TYPE_CONFIRMABLE
[INFO][COAP]: Id:       26
[INFO][COAP]: Token:        2d:a1:0f:0b
[INFO][COAP]: Content-type: 0
[INFO][COAP]: Payload len:  3
[INFO][COAP]: Max-age:      0
[INFO][COAP]: Observe:      247
[INFO][COAP]: ======== End of CoAP package ========
[INFO][mClt]: M2MNsdlInterface::observation_to_be_sent() - send already in progress
Counter 286
[INFO][mClt]: M2MNsdlInterface::observation_to_be_sent() - send already in progress
Counter 287
[INFO][COAP]: ======== Outgoing CoAP package ========
[INFO][COAP]: Status:       COAP_STATUS_OK
[INFO][COAP]: Code:     COAP_MSG_CODE_RESPONSE_CONTENT
[INFO][COAP]: Type:     COAP_MSG_TYPE_CONFIRMABLE
[INFO][COAP]: Id:       26
[INFO][COAP]: Token:        2d:a1:0f:0b
[INFO][COAP]: Content-type: 0
[INFO][COAP]: Payload len:  3
[INFO][COAP]: Max-age:      0
[INFO][COAP]: Observe:      247
[INFO][COAP]: ======== End of CoAP package ========
[INFO][mClt]: M2MNsdlInterface::observation_to_be_sent() - send already in progress
Counter 288
[INFO][mClt]: M2MNsdlInterface::observation_to_be_sent() - send already in progress
Counter 289
[INFO][mClt]: M2MNsdlInterface::observation_to_be_sent() - send already in progress
Counter 290
[INFO][mClt]: M2MNsdlInterface::observation_to_be_sent() - send already in progress
Counter 291
[INFO][COAP]: ======== Outgoing CoAP package ========
[INFO][COAP]: Status:       COAP_STATUS_OK
[INFO][COAP]: Code:     COAP_MSG_CODE_RESPONSE_CONTENT
[INFO][COAP]: Type:     COAP_MSG_TYPE_CONFIRMABLE
[INFO][COAP]: Id:       26
[INFO][COAP]: Token:        2d:a1:0f:0b
[INFO][COAP]: Content-type: 0
[INFO][COAP]: Payload len:  3
[INFO][COAP]: Max-age:      0
[INFO][COAP]: Observe:      247
[INFO][COAP]: ======== End of CoAP package ========
[INFO][mClt]: M2MNsdlInterface::observation_to_be_sent() - send already in progress
Counter 292
[INFO][mClt]: M2MNsdlInterface::observation_to_be_sent() - send already in progress
Counter 293
[INFO][mClt]: M2MNsdlInterface::observation_to_be_sent() - send already in progress
Counter 294
[INFO][mClt]: M2MNsdlInterface::observation_to_be_sent() - send already in progress
Counter 295
[INFO][mClt]: M2MNsdlInterface::observation_to_be_sent() - send already in progress
Counter 296
[INFO][mClt]: M2MNsdlInterface::observation_to_be_sent() - send already in progress
Counter 297
[INFO][mClt]: M2MNsdlInterface::observation_to_be_sent() - send already in progress
Counter 298
[INFO][mClt]: M2MNsdlInterface::observation_to_be_sent() - send already in progress
Counter 299
[INFO][mClt]: M2MNsdlInterface::observation_to_be_sent() - send already in progress
Counter 300
[INFO][mClt]: M2MNsdlInterface::observation_to_be_sent() - send already in progress
Counter 301
[INFO][mClt]: M2MNsdlInterface::observation_to_be_sent() - send already in progress
Counter 302
[INFO][mClt]: M2MNsdlInterface::received_from_server_callback - message sending failed, id 26
[ERR ][mClt]: M2MInterfaceImpl::registration_error code [6]
[ERR ][mClt]: M2MInterfaceImpl::socket_error: (-8), retry (1), reconnecting (0), reconnection_state (1)
[INFO][mClt]: M2MInterfaceImpl::create_random_initial_reconnection_time() initial random time 34

[INFO][mClt]: M2MInterfaceImpl::socket_error - reconnecting in 34(s)
[ERR ][mClt]: ConnectorClient::error() - error: 6
[INFO][fcc ]: storage_pal_sst.c:602:storage_item_get_data:===> item name = mbed.UseBootstrap len = 17, data max size = 4
[INFO][fcc ]: storage_pal_sst.c:640:storage_item_get_data:<=== 
[ERR ][mClt]: ServiceClient::connector_error() error 6
[ERR ][mClt]: MbedCloudClient::error code (6)
client_error(6) -> Client in reconnection mode NetworkError
[ERR ][mClt]: ServiceClient::state_failure()
[INFO][mClt]: MbedCloudClient::complete status (-1)

I.e. the notification sending fails - client tries to reconnect, but it just does not help and connection never recovers. Counter was only 283 in this case, so it happened quite quicly that time (I had a total of 4 test runs over the weekend, every one of them failed).

For reference: NXP K66F worked flawlessly the whole weekend, even with the Ethernet cable intentionally plugged out for a while.

JanneKiiskila commented 4 years ago

Same story with mbed-cloud-client-example, log from that, too.

Counter 283
[INFO][COAP]: ======== Incoming CoAP package ========
[INFO][COAP]: Status:       COAP_STATUS_OK
[INFO][COAP]: Code:     COAP_MSG_CODE_EMPTY
[INFO][COAP]: Type:     COAP_MSG_TYPE_ACKNOWLEDGEMENT
[INFO][COAP]: Id:       25
[INFO][COAP]: Payload len:  0
[INFO][COAP]: ======== End of CoAP package ========
[INFO][mClt]: M2MNsdlInterface::send_resource_observation - uri 3200/0/5501
[INFO][COAP]: ======== Outgoing CoAP package ========
[INFO][COAP]: Status:       COAP_STATUS_OK
[INFO][COAP]: Code:     COAP_MSG_CODE_RESPONSE_CONTENT
[INFO][COAP]: Type:     COAP_MSG_TYPE_CONFIRMABLE
[INFO][COAP]: Id:       26
[INFO][COAP]: Token:        2d:a1:0f:0b
[INFO][COAP]: Content-type: 0
[INFO][COAP]: Payload len:  3
[INFO][COAP]: Max-age:      0
[INFO][COAP]: Observe:      247
[INFO][COAP]: ======== End of CoAP package ========
Counter 284
[INFO][mClt]: M2MNsdlInterface::observation_to_be_sent() - send already in progress
Counter 285
[INFO][COAP]: ======== Outgoing CoAP package ========
[INFO][COAP]: Status:       COAP_STATUS_OK
[INFO][COAP]: Code:     COAP_MSG_CODE_RESPONSE_CONTENT
[INFO][COAP]: Type:     COAP_MSG_TYPE_CONFIRMABLE
[INFO][COAP]: Id:       26
[INFO][COAP]: Token:        2d:a1:0f:0b
[INFO][COAP]: Content-type: 0
[INFO][COAP]: Payload len:  3
[INFO][COAP]: Max-age:      0
[INFO][COAP]: Observe:      247
[INFO][COAP]: ======== End of CoAP package ========
[INFO][mClt]: M2MNsdlInterface::observation_to_be_sent() - send already in progress
Counter 286
[INFO][mClt]: M2MNsdlInterface::observation_to_be_sent() - send already in progress
Counter 287
[INFO][COAP]: ======== Outgoing CoAP package ========
[INFO][COAP]: Status:       COAP_STATUS_OK
[INFO][COAP]: Code:     COAP_MSG_CODE_RESPONSE_CONTENT
[INFO][COAP]: Type:     COAP_MSG_TYPE_CONFIRMABLE
[INFO][COAP]: Id:       26
[INFO][COAP]: Token:        2d:a1:0f:0b
[INFO][COAP]: Content-type: 0
[INFO][COAP]: Payload len:  3
[INFO][COAP]: Max-age:      0
[INFO][COAP]: Observe:      247
[INFO][COAP]: ======== End of CoAP package ========
[INFO][mClt]: M2MNsdlInterface::observation_to_be_sent() - send already in progress
Counter 288
[INFO][mClt]: M2MNsdlInterface::observation_to_be_sent() - send already in progress
Counter 289
[INFO][mClt]: M2MNsdlInterface::observation_to_be_sent() - send already in progress
Counter 290
[INFO][mClt]: M2MNsdlInterface::observation_to_be_sent() - send already in progress
Counter 291
[INFO][COAP]: ======== Outgoing CoAP package ========
[INFO][COAP]: Status:       COAP_STATUS_OK
[INFO][COAP]: Code:     COAP_MSG_CODE_RESPONSE_CONTENT
[INFO][COAP]: Type:     COAP_MSG_TYPE_CONFIRMABLE
[INFO][COAP]: Id:       26
[INFO][COAP]: Token:        2d:a1:0f:0b
[INFO][COAP]: Content-type: 0
[INFO][COAP]: Payload len:  3
[INFO][COAP]: Max-age:      0
[INFO][COAP]: Observe:      247
[INFO][COAP]: ======== End of CoAP package ========
[INFO][mClt]: M2MNsdlInterface::observation_to_be_sent() - send already in progress
Counter 292
[INFO][mClt]: M2MNsdlInterface::observation_to_be_sent() - send already in progress
Counter 293
[INFO][mClt]: M2MNsdlInterface::observation_to_be_sent() - send already in progress
Counter 294
[INFO][mClt]: M2MNsdlInterface::observation_to_be_sent() - send already in progress
Counter 295
[INFO][mClt]: M2MNsdlInterface::observation_to_be_sent() - send already in progress
Counter 296
[INFO][mClt]: M2MNsdlInterface::observation_to_be_sent() - send already in progress
Counter 297
[INFO][mClt]: M2MNsdlInterface::observation_to_be_sent() - send already in progress
Counter 298
[INFO][mClt]: M2MNsdlInterface::observation_to_be_sent() - send already in progress
Counter 299
[INFO][mClt]: M2MNsdlInterface::observation_to_be_sent() - send already in progress
Counter 300
[INFO][mClt]: M2MNsdlInterface::observation_to_be_sent() - send already in progress
Counter 301
[INFO][mClt]: M2MNsdlInterface::observation_to_be_sent() - send already in progress
Counter 302
[INFO][mClt]: M2MNsdlInterface::received_from_server_callback - message sending failed, id 26
[ERR ][mClt]: M2MInterfaceImpl::registration_error code [6]
[ERR ][mClt]: M2MInterfaceImpl::socket_error: (-8), retry (1), reconnecting (0), reconnection_state (1)
[INFO][mClt]: M2MInterfaceImpl::create_random_initial_reconnection_time() initial random time 34

[INFO][mClt]: M2MInterfaceImpl::socket_error - reconnecting in 34(s)
[ERR ][mClt]: ConnectorClient::error() - error: 6
[INFO][fcc ]: storage_pal_sst.c:602:storage_item_get_data:===> item name = mbed.UseBootstrap len = 17, data max size = 4
[INFO][fcc ]: storage_pal_sst.c:640:storage_item_get_data:<=== 
[ERR ][mClt]: ServiceClient::connector_error() error 6
[ERR ][mClt]: MbedCloudClient::error code (6)
client_error(6) -> Client in reconnection mode NetworkError

Counter is at that same 283, so that seems like too much of a co-incidence?

ciarmcom commented 4 years ago

Internal Jira reference: https://jira.arm.com/browse/MBOTRIAGE-2559

se7ensong commented 4 years ago

I have seen this issue too. Thank you @JanneKiiskila goes into the detail.

I actually have seen this issue on mebd-os-5.14.2 and it seems that ARP packet cannot be received after running for a while.

By connecting the ublox to the wifi hotspot on my laptop and use Wireshark for packet capture, I traced things back to the Ublox driver. However, since it's closed source, I cannot proceed anymore. The LWIP trace shows that Ublox requires ARP at 270 seconds (30 seconds before ARP table expires) and it usually gets a response to renew the table. However, after a while, the packet capture shows the gateway(my laptop) provided an ARP response but it wasn't received by the LWIP stack although SSDP packet was received without any issue. Hence, no packet can be sent.

I temporarily goes around the issue by enabling static ARP but would like to see a much more reliable ublox driver.

JanneKiiskila commented 4 years ago

@ARMmbed/team-ublox - please comment / analyse this issue.

aqib-ublox commented 4 years ago

This issue is just seems duplicate of https://github.com/ARMmbed/mbed-os/issues/10815

we would resume analysis on it and would see how we may fix in driver.

for wifi we have analyzed and turning off DHCP and switching to static IP's avoid a crash.

MarceloSalazar commented 4 years ago

Closing as it's a duplicate of https://github.com/ARMmbed/mbed-os/issues/10815

JanneKiiskila commented 4 years ago

This is not a crash, this is a jam. So, I don't see the problem symptom being the same necessarily - there is no Mbed OS crash.

JanneKiiskila commented 4 years ago

Reproducing this issue not require any noisy Wi-Fi environment either, just enough time (2-3 hours).

MarceloSalazar commented 4 years ago

@JanneKiiskila thanks for clarifying this. I'm reopening for @ARMmbed/team-ublox to investigate if they can.

aqib-ublox commented 4 years ago

@JanneKiiskila Are you getting device running into JAM condition without DHCP if practice for 2-3 hours?

JanneKiiskila commented 4 years ago

I have not tried with static IP addresses, it's not a realistic scenario for many setups.