NordicSemiconductor / Android-nRF-Mesh-Library

The Bluetooth Mesh Provisioner and Configurator library.
https://www.nordicsemi.com/
BSD 3-Clause "New" or "Revised" License
406 stars 174 forks source link

Segmentation feedback of acknowledged msg is mostly failing when sent to non-proxy device #426

Closed ghost closed 1 year ago

ghost commented 3 years ago

Describe the bug As title states, when sending segmented acknowledged messages to non-proxy devices, it is too often failing to construct back the Status.

To Reproduce Steps to reproduce the behavior:

  1. Provision at least 2 nodes
  2. Connect to proxy
  3. Send either CompositionDataGet, ConfigModelPublicationSet, or any msg that would result in segmented answer to the device that is not the proxy
  4. Witness the library failing to construct the Status msg -- it can totally fail or will pass it once a retransmission from the device is done

Expected behavior The library is able to construct the Status as soon as all feedback segments has been received (eg. SEG 0 == SEG N), and the non-proxy device should not have to retransmit at this point (?).

Platform details:

Logs / Screenshots I spent some time to build enough material for you. I hope it will be enough to spot any bug, or give us hint on why this behavior.

log files suffixed with OK --> expected behavior log files suffixed with KO --> unexpected behavior: all segments received from device but no Status is being constructed. But we have it once the device is retransmitting some segments _log files suffixed with full_KO_ --> unexpected behavior: all segments received from device but no Status is being constructed. The device seems to have received every acknowledgements as it is not retransmitting

Unfortunately, this behavior is even more problematic in our commercial application as we have implemented a FIFO for such msg with auto-retry mechanism. Meaning this kind of msg is in a FIFO, when sending a cmd, we wait for feedback and retry every 2.5 sec (retried 2 times). The success rate is below 15% because of this problem and it makes the app unusable for a lot of our features. If we increase the retry timer to 5-6sec, the success rate raises to ~60%, but it is still not enough, and is not acceptable for our customers. We may force proxy connection to raise success rate to 95%, but it is not acceptable for most features that need these segmented feedbacks, and moreover it's blocking the power of mesh n/w.

roshanrajaratnam commented 3 years ago

Hi Rom4in thanks for the detailed report. I will try to take a look at this to see where the issue might be. In the mean time could you check if you are having the same problem on version 2.4.1?

ghost commented 3 years ago

Not sure for 2.4.1, but we are using 2.3.0 on our mainline app. And yes, no problems on this version

ghost commented 3 years ago

I tried to make code comparison between 3.1.5 and 2.3.0, but too much changes 😅 And I don't really know where to look in the codebase

ghost commented 3 years ago

Hello @roshanrajaratnam :) any updates ?

roshanrajaratnam commented 3 years ago

Hey sorry, been busy with some other tasks. Unfortunately i won't be back to work until Tuesday due to national holidays in Norway. I'll try to take a look at this next week!

Get Outlook for Androidhttps://aka.ms/AAb9ysg


From: R0m4in-dooz @.> Sent: Friday, May 14, 2021 12:27:58 PM To: NordicSemiconductor/Android-nRF-Mesh-Library @.> Cc: Rajaratnam, Roshan @.>; Mention @.> Subject: Re: [NordicSemiconductor/Android-nRF-Mesh-Library] Segmentation feedback of acknowledged msg is mostly failing when sent to non-proxy device (#426)

Hello @roshanrajaratnamhttps://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Froshanrajaratnam&data=04%7C01%7Croshan.rajaratnam%40nordicsemi.no%7Cdb2f0299f99b4136b12e08d916c2f292%7C28e5afa2bf6f419a8cf6b31c6e9e5e8d%7C0%7C0%7C637565848823122352%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=XWvT6ztKu%2BuY4ESEV7zPu0chco9wMtn3GdSufYoLZBk%3D&reserved=0 :) any updates ?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FNordicSemiconductor%2FAndroid-nRF-Mesh-Library%2Fissues%2F426%23issuecomment-841158540&data=04%7C01%7Croshan.rajaratnam%40nordicsemi.no%7Cdb2f0299f99b4136b12e08d916c2f292%7C28e5afa2bf6f419a8cf6b31c6e9e5e8d%7C0%7C0%7C637565848823132309%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=k9v5vR%2Bq7ainPfdlb6EzudIOZYCYQsT9Qz7%2BjiKLRd0%3D&reserved=0, or unsubscribehttps://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FACCK6DOVN7VECYUVSXCTPW3TNT3K5ANCNFSM44KKHPZQ&data=04%7C01%7Croshan.rajaratnam%40nordicsemi.no%7Cdb2f0299f99b4136b12e08d916c2f292%7C28e5afa2bf6f419a8cf6b31c6e9e5e8d%7C0%7C0%7C637565848823132309%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=Ki%2F%2FGHMT7DDAG2twe4x4uGdbtdYZbKcSDTCJkDz%2FKw8%3D&reserved=0.

roshanrajaratnam commented 3 years ago

@R0m4in-dooz I have had some training programs yesterday and today so I have not had time yet to look in to your issue. I will try my best to have a look at this during this week.

ghost commented 3 years ago

Thanks @roshanrajaratnam, if I may help, with other logs or giving you our custom msg, feel free to ask :)

roshanrajaratnam commented 3 years ago

Could you also confirm if this is happening with the nRF SDK examples as well?

ghost commented 3 years ago

If you mean with the nrf example app, yes it has the same behavior: sometimes it's ok, sometimes it's ok after a retransmission, sometimes it never constructs the feedback

roshanrajaratnam commented 3 years ago

yes example app with the nRF SDK Mesh light/switch examples?

ghost commented 3 years ago

🤔 i only use our firmwares, based on nrf mesh sdk 4.0

ghost commented 3 years ago

And I reproduced with publication set msg, with provisioned nodes, no configuration

roshanrajaratnam commented 3 years ago

Hi @R0m4in-dooz finally found sometime to look at your issue. However looking at your logs and also after some debugging, I notice that "Message reassembly may not be completed yet!" is logged only when the message is not completed or when the reassembly might be in progress. If you notice when this is logged not all expected segments are received.

ghost commented 3 years ago

Yeh, sometimes I expect this log but:

I notice that there are a lot of

E/DefaultNoOperationMessageState(23796): Decryption failed in NetworkLayer : mac check in CCM failed
D/DoozMeshStatusCallbacks(23796): onMessageDecryptionFailed

maybe it's the cause of the issue ?

roshanrajaratnam commented 3 years ago

In a mesh network you may receive messages that are not directed to you as they are being relayed, in the network layer we could find out if it was directed to us. There could be instances where you can receive such messages hence we drop it. So the segments you are referring are the once received by the device.

The decryption failure is not something I am able produce using our sample app at all. Have you done any changes? if the decryption fails yes the LowerTransportLayer will timeout eventually.

ghost commented 3 years ago

In a mesh network you may receive messages that are not directed to you as they are being relayed, in the network layer we could find out if it was directed to us. There could be instances where you can receive such messages hence we drop it. So the segments you are referring are the once received by the device.

Ok makes sense :)

The decryption failure is not something I am able produce using our sample app at all. Have you done any changes? if the decryption fails yes the LowerTransportLayer will timeout eventually.

I will try again with sample app and give you those logs if I have it again 👍🏼 but we only added our custom ApplicationMessage on top of 3.1.5 + new build flavor which fallback to your definition (specific flutter bug fix). So you don't have any issue to have ConfigModelPublicationSetStatus when it comes from non-proxy device ?

What about composition data log KO_2 and publication set log full_KO ? You never have this behavior ?

roshanrajaratnam commented 3 years ago

ok sounds good. I do not see any issues relating to CompositionDataGet or ConfigModelPublicationSet message sent via proxy or to proxy. Let me know how this goes for you.

ghost commented 3 years ago

Hello there. Sorry I didn't have time to look into this issue as we shipped force proxy as a temporary workaround on prod and I was busy with some other priorities! We may find some time to look for it. But first we will try out the new release 🕺

ghost commented 3 years ago

Hello ! After upgrading to latest release, we still have issues on segmented answers. It is behaving better tho ! I don't have anymore issues on CompositionDataGet, nor on our custom 2 segments ApplicationMessage. However, for ConfigModelPublicationSet, I still experience troubles to construct Status from segments. I think it has changed behavior and now, I always have these two logs when it fails:

D/DoozMeshStatusCallbacks(13381): onBlockAcknowledgementProcessed
E/DefaultNoOperationMessageState(13381): Decryption failed in NetworkLayer : mac check in CCM failed
D/DoozMeshStatusCallbacks(13381): onMessageDecryptionFailed

Even if later it has these two logs:

V/BlockAcknowledgementMessage(13381): Segment 0 of 1 received by peer
V/BlockAcknowledgementMessage(13381): Segment 1 of 1 received by peer

The success rate has increase so it is better, but for pubs it is still not enough and I can't revert our force proxy code workaround. I feel like our FIFO with auto-retry is messing with your "segment queue", because often, it is constructing the Status some seconds after our retries are done (3x publication set) when target device resend 1 of the segments.

ghost commented 3 years ago

It is still rly better, so thank for your work on low layers :)

roshanrajaratnam commented 3 years ago

hey again, no problem, does this happen on a custom firmware, could you share it? I am not able to reproduce this on our sdk.

ghost commented 3 years ago

I will ask our embedded team

ghost commented 3 years ago

@roshanrajaratnam is roshan.rajaratnam@nordicsemi.no a valid email to receive a Google Drive share link ?

roshanrajaratnam commented 3 years ago

android-devs@nordicsemi.no

ghost commented 3 years ago

I just sent the .hex file via Google Drive share

roshanrajaratnam commented 3 years ago

Thanks I just got it, I will take a look at it later during the day. What dk is this built for ?

ghost commented 3 years ago

This one :) compatible_DK

ghost commented 3 years ago

Hello ! Any update on the subject?

philips77 commented 3 years ago

Hi, we're all on vacations right now. Please expect some delays in responses in July.

ghost commented 2 years ago

Hi there ! I see you are closing issues due to inactivity. This issue is still there, and I'm still awaiting support from you guys. I provided a .hex file in beginning of summer and still no feedback but this issue is labelled "awaiting user input" ?? Please notify me if you find any time to try to reproduce the issue with our firmware 😁

roshanrajaratnam commented 2 years ago

Hi, it was labelled awaiting user input long time ago before we got a response from you. I have been busy with some other projects. I will look in to yours ;)

roshanrajaratnam commented 2 years ago

@R0m4in-dooz could you share your fw again please? Can't seem to find it anymore!

ghost commented 2 years ago

sure, I may have deleted it 😅

Still this mail ?

android-devs@nordicsemi.no

roshanrajaratnam commented 2 years ago

indeed!

ghost commented 2 years ago

sent 👍

roshanrajaratnam commented 2 years ago

Hi couple of things here, I find the firmware a bit flaky. Sometimes it provisions and sometimes it doesn't if I disconnect after completing provisioning before going in to configuration. I see this happening only with the debug firmware that you had sent.

Also I noticed a decryption failure that you have pointed out earlier in our conversation which only seem to happen with your firmware. I have done multiple tests with both our example sdk and the debug firmware you sent and I don't see this happening against the Nordic sdk examples.

Although I notice there is an issue with the proxy node not sending the response/status to acknowledged messages. According to what I see on logs on my end, I get block acks that all segments are received and it stops there and I don't seem to receive the response over GATT via the proxy node. So this could be related to the issue you are facing. I would recommend creating a ticket on DevZone so that someone from the technical support could provide support on this case.

roshanrajaratnam commented 2 years ago

@R0m4in-dooz there was a fix made couple of days ago on #458, but this applies to all segmented messages relating to the sequence number getting reset. The decryption failure in your case has still been the same as pointed out earlier in our conversations. Please test it on your end as well just see if we are on the same page!

ghost commented 2 years ago

Hello ! Thank you for your time @roshanrajaratnam !

I find the firmware a bit flaky

It may be possible that the FW I sent is buggy, we made numerous updates since then 😅 (I sent you firmware from July)

Also I noticed a decryption failure that you have pointed out earlier in our conversation which only seem to happen with your firmware. I have done multiple tests with both our example sdk and the debug firmware you sent and I don't see this happening against the Nordic sdk examples.

🤔 ok then, if it is only with our firmware I will tell our embedded team to investigate

Although I notice there is an issue with the proxy node not sending the response/status to acknowledged messages.

Does this means that you indeed see a bug ? But you think it's only with our custom firmware ?

I would recommend creating a ticket on DevZone so that someone from the technical support could provide support on this case.

We recently upgraded the Mesh SDK and are in the process of validation now. I will test against these new firmwares and if still there I can indeed think to post a ticket in DevZone...

So for you, nothing is wrong with segmented messages handling in the current library version ? We are on 3.1.6 official release, maybe the dev branch is correcting some issues ? I should check commit history...

Please test it on your end as well just see if we are on the same page!

Sure, I'll try to find some time to make new tests against our updated firmwares and against dev branch ASAP. Thanks again, I'll let you know how it goes 👍

bslisowski commented 1 year ago

@R0m4in-dooz was your embedded team able to resolve the issue of the proxy node not sending response messages? I am having the same issue using your flutter plugin for my mobile app and esp32's for nodes.

ghost commented 1 year ago

Hello @bslisowski ,

Please use our repository to post issues ! Well, unfortunately I am slowly forgotting our adventures on mesh network..if I remember correctly, no we never fixed that bug and could not figure out if it was a firmware problem or a mobile app problem.

Hi @roshanrajaratnam, i guess you can close this one, it was probably a problem on our side 👍

bslisowski commented 1 year ago

I was going to post an issue on your repository, but I don't think your plugin is the problem since it still persists when using the latest version of the nRF mesh app.

roshanrajaratnam commented 1 year ago

@bslisowski I would recommend creating an issue on d so that our tech support should be able to help you on this issue as this seems to be more or less related to the firmware!

bslisowski commented 1 year ago

I believe I found the issue - the placement of the bluetooth antenna on our PCB is interfering with the transmission. I was able to increase the transmission power and the problem is almost completely gone.

roshanrajaratnam commented 1 year ago

Awesome, I'll close this issue.