ConnectivityFoundry / AwaLWM2M

Awa LWM2M is an implementation of the OMA Lightweight M2M protocol in C.
BSD 3-Clause "New" or "Revised" License
102 stars 66 forks source link

Handle sendto() function failure conditions #333

Closed pranit-sirsat-imgtec closed 7 years ago

pranit-sirsat-imgtec commented 7 years ago

The sendto() function can fail because of multiple conditions. Currently we are handling few conditions. If there is an error due to another condition, the process gets stuck in repeatedly trying to call sendto() function and failing with the same error condition. For example, when awa session is established and after that network interface goes down(ex. ethernet cable is removed), the sendto() fuction fails with error 'ENETUNREACH'. This error is not handled and the process gets stuck in repeatedly calling sendto() fuction. There are also other possible error conditions. In this change handling all error conditions, so that process does not get stuck in endless loop.

Ref: AWA-311 Signed-off-by: Pranit Tanaji Sirsat Pranit.Sirsat@imgtec.com

boyvinall commented 7 years ago

Ref: AWA-311

@pranit-sirsat-imgtec do you mean #311 ?

pranit-sirsat-imgtec commented 7 years ago

@pranit-sirsat-imgtec do you mean #311 ?

yes, i added that in commit message. i am using the github for first time so don't know if something has gone wrong

boyvinall commented 7 years ago

@pranit-sirsat-imgtec No problem. To refer to github issues or pull requests, take a look at https://help.github.com/articles/autolinked-references-and-urls/. Always worth checking the "preview" tab when you're writing comments too, at least until you get used to it.

datachi7d commented 7 years ago

@boyvinall - this looks like it removes error checking. Can you review this again please?

datachi7d commented 7 years ago

@pranit-sirsat-imgtec is it possible to write some tests that show the issue without your fix? This will help us understand how your changes fix the issue you saw.

pranit-sirsat-imgtec commented 7 years ago

Hi @datachi7d, thank you for your comment.

can you please explain how it removes the error checking ?

the if else condition was previously checking for few possible errors on which sendto() could not have recovered and returning NetworkSocketError_SendError. for other error conditions, considering those conditions will recover, it was continuing in while loop and calling sendto(). Now if the error condition is unrecoverable then the process is stuck in while loop and useless after that. one example as wrote in commit message was of ENETUNREACH.

now in the new code, as per my understanding EWOULDBLOCK and EINTR are immediately recoverable so remaining in while loop, otherwise for other conditions it will return NetworkSocketError_SendError. this way it will never get stuck in while loop. in case the error was recoverable, the NetworkSocket_Send() function will get called again after some time with same coap data (at least i had checked about coap_transactions)

list of possible errors http://pubs.opengroup.org/onlinepubs/009695399/functions/sendto.html

pranit-sirsat-imgtec commented 7 years ago

as we use the awa_clientd daemon to reproduce the issue,

1] get awa_clientd daemon connected with device server 2] call awa-client-get on command line, should work 3] down the interface, ifconfig down ...... 4] wait for few second i think the poll duration in main loop (some 3-4 sec.) 5] call awa-client-get on command line, should fail, with timeout error

for test case, if i get time i will take look at other tests, but for this we will need to up/down interface. do you have fixed name for interfaces on test device ? will that affect the CI system ?

boyvinall commented 7 years ago

@datachi7d I don't think this does remove error checks? Previously, the code was testing a bunch of lastError values but handling them all the same way. There were potentially some values of lastError that were not tested for. The patch has reduced identical handling into a single else which also catches any other values of lastError .. so actually it's adding error handling?

boyvinall commented 7 years ago

@datachi7d However, the point about a test case to catch this in future is a good one. @pranit-sirsat-imgtec we would welcome that if you get the chance, thanks.

datachi7d commented 7 years ago

@boyvinall my mistake - I was confused by the title combined with the removal of code (I was expecting something to be added). So is there a possibility that some of the returned values caught by the else are not errors? For example EAGAIN.

datachi7d commented 7 years ago

@pranit-sirsat-imgtec yes this looks tricky to test. Having a mechanism to test this scenario would help with creating transparency of what's happening with the network stack (#319 or #313) as it would allow testing of conditions that lead to network related errors. Using a tun/tap device to create a pseudo interface that can be managed from the CI tests (ifconfig tun0 down) could be a place to start.

@DavidAntliff do we run our CI jobs in docker containers? If the tests aren't in a container getting permission to create the tun/tap device means sudo-ing the tests (very bad).

datachi7d commented 7 years ago

@DavidAntliff looks like the Jenkins CI jobs are in a container but what about the Travis job?

DavidAntliff commented 7 years ago

@datachi7d yes, the tests run in docker containers. I don't recall there being any issue with using tun/tap inside them, but I can't be sure. Travis uses Docker too, but implicitly.