How to handle DTLS connection lost at server side in none queue mode ?

sbernard31 commented 6 years ago

Thoses questions concern LWM2M 1.0 and 1.1 over UDP/DTLS

In LWM2M there is 2 mode "standard" and "queuemode". Queue mode is generally used for environment where client IP is dynamic. Standard mode needs a static IP.

Imagine that LWM2M registration is persited and DTLS session and connection are not. I would like to understand what happens when server lost DTLS Connection of registered client. (it could happen for several reasons : server crash, server update, limited connection/session lifetime ...)

In queue mode, this is not an issue as server will not initiate communication, it will always start to send request after client initiates a dtls session/connection. (if server lost the connection, the client will detect it more or less easily and rehandshake)

In standard mode, client IP should not changed and server could send request at any moment. In the case, if server lost the DTLS connection and still have the right IP of the device (static IP) in the persisted registration, it could send a request to the device/client if it acts as a DTLS client.

The LWM2M 1.0 specification says : "The client-server roles of DTLS, which indicate who initiates the DTLS handshake, are independent from the client-server relationship of LwM2M." But this seems to be mainly about server initiated bootstrap. (see §7.1.6 LwM2M and DTLS Roles)

If we are looking at the LWM2M 1.1 specification : "In LwM2M version 1.1 the LwM2M Client is always the TLS/DTLS client." (see 5.2.7. LwM2M and TLS/DTLS Roles)

By the way the LWM2M 1.1 specification say : "This document augments LwM2M version 1.0. Version 1.1 is backwards compatible to v1.0 with respect to mandatory features." (see 1.2. LwM2M version 1.0)

So, my question are :

What is the good way to handle the session/connection lost at server side ? I thought that the idea was to make the LWM2M server act as a DTLS client but this is not allowed by the LWM2M 1.1 specification, so did I missed something ?
How the LWM2M 1.1 spec could be backwards compatible if it is more restrictive than the 1.0 specification ?

Of course, any feedbacks is welcome but @hannestschofenig, @boaks, @dnav I would really appreciate to know your opinion on this ? :pray:

boaks commented 6 years ago

My view is just a pragmatic one: On my experience, clients which are "long term reverse reachable" are that rare, that I don't care, what must be done to support them best. Even if LWM2M 1.1 breaks at that point the backwards compatibility to 1.0, I'm not sure, if that affect a noteworthy number of systems. So FMPOV, it's not worth to include such a DTLS role exchange as mandatory. But everyone may feel free to include such a feature as optional.

sbernard31 commented 6 years ago

@boaks Thx a lot.

On my experience, clients which are "long term reverse reachable" are that rare, that I don't care, what must be done to support them best.

By the past, we had more or less the same position at Sierra. But it's changing... we plan to provide more and more device with static IP. (and so benefit of the "always here"/"server initiated" feature)

hannestschofenig commented 5 years ago

Queue mode has little todo with dynamic IP assignment.

The main use case for queue mode is in environments where IoT devices are intermittent connected, such as sleepy devices.

What is the good way to handle the session/connection lost at server side

You need to re-establish the state when the state is lost. There is no other way.

FWIW, LwM2M v1.1 is unfortunately not backwards compatible with LwM2M v1.0 despite the version numbers indicate that. For the raised issue this is not a problem since nobody implemented the role-reversal of DTLS.

boaks commented 5 years ago

For the raised issue this is not a problem since nobody implemented the role-reversal of DTLS.

sbernard31 implemented it in californium/scandium and so leshan uses this role-reversal. There is "a little more" added to support the reverse direction (more flexible request/response matching, when DTLS is used).

sbernard31 commented 5 years ago

Queue mode has little todo with dynamic IP assignment. The main use case for queue mode is in environments where IoT devices are intermittent connected, such as sleepy devices.

But in experience, devices behind NAT (dynamic IP assignment) are very common use cases (even more frequent than sleepy devices). So as implementer confronted to real life use-cases, I need to deal with this. (by the way, the specification says : "Any LwM2M Clients behind a NAT can use Queued Mode." )

Anyway Queue Mode is not the topic of this issue as QueueMode is client initiated communication. So there is no issue will fail-over as this is up to the client to re-establish the state.

So my question is about server initiated request after a fail-over, this concerns only standard mode (none queue mode).

And for now, I can not see other solution than DTLS-Exchange role and I see that this is not allowed in Lwm2m 1.1. :confused:. So what is the recommended way or alternative ?

For the raised issue this is not a problem since nobody implemented the role-reversal of DTLS

DTLS exchange role seems to works with californium/scandium (and so used by Leshan) and with wakamaa/tinydtls Here is some resource about DTLS-exchange role.

You need to re-establish the state when the state is lost. There is no other way.

OK but if DTLS exchange role is not allowed and server lost its state (because of failure, redeployment) and I need to make a server initiated communication. I can not see how I can do that without DTLS role exchange ?

hannestschofenig commented 5 years ago

I disagree with the statement of the initially filed issue, namely

Queue mode is generally used for environment where client IP is dynamic.
Standard mode needs a static IP.

The "standard mode" does not need a static IP address and the queue mode is generally used for sleepy devices.

We changed they way we use server-initiated bootstrapping in v1.1, which in our view works better in real world deployments. It is interesting to know that you implemented server-initiated bootstrapping with the DTLS server role reversal in Leshan though.

Regarding your questions:

What is the good way to handle the session/connection lost at server side ? I thought that the idea was to make the LWM2M server act as a DTLS client but this is not allowed by the LWM2M 1.1 specification, so did I missed something ?

The role reversal change in v1.1 really only concerned server-initiated bootstrapping but not for the communication between the LwM2M Server to the LwM2M Client. It, of course, depends why you have lost the connection on the server side on what you should actually be doing. If the server crashed then you have to wait till the clients connect again since you have lost all your state. If the device changed its IP address then you have to wait till it re-registers again since otherwise you do not know where to send any message. If the device entered a sleep mode then you have to wait till it wakes up again and sends a registration update message. If the NAT binding expired without the device sending registration updates or keepalive messages then you will not be able to send a message through the NAT to the IoT device anymore.

What cases have you been thinking that require the server to initiate communication from scratch using the DTLS role reversal?

How the LWM2M 1.1 spec could be backwards compatible if it is more restrictive than the 1.0 specification ?

A LwM2M Bootstrap-Server and a LwM2M Server is only backwards compatible if it implements v1.0 in addition to the newer v1.1

sbernard31 commented 5 years ago

I disagree with the statement of the initially filed issue

I was just talking about our experiences, based on use cases we faced in our day to day work at Sierra Wireless and regarding feedback about Leshan users/contributors. Maybe your experience is different.

It is interesting to know that you implemented server-initiated bootstrapping with the DTLS server role reversal in Leshan though.

There is a missunderstanding, I didn't talk about server-initiated bootstrapping, I talked about LWM2M server. Just to be clear, we didn't implement server-initiated bootstrapping at all.

The role reversal change in v1.1 really only concerned server-initiated bootstrapping

Good to know. Meaning Exchange-Role could eventually be used for LWM2M server ?

If the server crashed then you have to wait till the clients connect again since you have lost all your state.

In my case, I just lost my DTLS state.
Registration is persisted, so I know device IP addresses.
So if address is fixed, we could imagine that server initiates the DTLS connection.

If the device changed its IP address then you have to wait till it re-registers again since otherwise you do not know where to send any message.

In our experience, for standard mode, users expect that if device is registered, it must be "reachable". That's why for dynamic IP environment, they generally use queue mode. But maybe this is because most of dynamic IP environment we faced are about NAT.

If the device entered a sleep mode then you have to wait till it wakes up again and sends a registration update message.

That's OK.

If the NAT binding expired without the device sending registration updates or keepalive messages then you will not be able to send a message through the NAT to the IoT device anymore.

In out experiences, Queue mode is OK for NAT. (no fail over problem as device initiates communication)

What cases have you been thinking that require the server to initiate communication from scratch using the DTLS role reversal?

I will try to explain the use case more deeply. So the use case is about using DTLS, "standard mode" with fixed IP and LWM2M server (not about bootstrap).

Device registers to the LWM2M server, It initiates DTLS connection. Server persists the registration (Leshan is able to do that) but not DTLS state (Californium is not able to do that, and currently I didn't know any implementation which are able to do that) Device will not do so much update as it is always connected and have fixed IP. Server is able to send to device downlink request at any moment. (In our experience this is expected by users). Now server is redeployed. So we lost DTLS state but still have device ip address which is stored in the persisted registration :

Without DTLS exchange role, server will not be able to connect device anymore. It must wait device contacts it again. But device is not able to know that server lost the connection and so this could be very long before connection was established again. In our experience, this service interruption is not really appreciated or expected.
With DTLS role exchange server just initiates the DTLS connection and then is able to send a new downlink request.

We currently dig the DTLS role exchange way, because we didn't see any other solution. (except of course persisting DTLS state). And currently our first ~investigation~ experimentation seems to show thats some implementation already works.

Hoping this is clearer.

hannestschofenig commented 5 years ago

Thanks for the detailed description. I believe I now understand the issues.

hannestschofenig commented 5 years ago

After the lifetime of the registration expires the LwM2M Client will send a Registration Update message (or some other message) and will find out that the state vanished. It will then re-establish the DTLS security state.

Would this work for you?

sbernard31 commented 5 years ago

Not really, as we would like to be able to contact the device after a redeployment. Between redeployment and device registration update, service will be unavailable. I mean user will not able to send downlink request from the server to all the devices. This interruption of service could be very long.

Initiate the DTLS connection at server side could make redeployment transparent for end-users.

boaks commented 5 years ago

Just to mention it (again).

https://github.com/eclipse/wakaama/issues/398#issuecomment-435471120

If the session id and premaster secrets are stored, and a resumption handshake is supported, FMPOV a "role exchange" in a resumption handshake is easier, though it doesn't use "credential roles". The downside would be the a stored "premaster secret".

sbernard31 commented 5 years ago

a "role exchange" in a resumption handshake

This is something we could experiment too but currently I'm not sure to see how this could be easier because I'm not sure there is "credentials roles" issues.

boaks commented 5 years ago

because I'm not sure there is "credentials roles" issues.

That depends.

FMPOV, some discussions in the past indicated, that a "role exchange" is at least not obvious (issue #206 , OK, the discussion was between two, who have their doubts about the usage.)

May be it's also not "really difficult". (The link you provided in "And currently our first investigation seems" points to your own work, so the term investigation is somehow misleading.)

With the resumption, it's even no discussion about the credentials :-). But I also understand, if you have a solution that works, then there is no need to adapt it.

sbernard31 commented 5 years ago

The link you provided in "And currently our first investigation seems" points to your own work, so the term investigation is somehow misleading.)

Maybe I should use "experimentation" instead of "investigation" ? Sry if this brings misleading.

If you call "my own work" the "result of the investigation/experimentation", I'm OK.

But If you call scandium and tinyDTLS itself, I feel this is not appropriated.
As far as I remembered, I did nothing about that in tinyDTLS. And for scandium, I just do little modifications related to this very long time ago... All was almost there and to be honest when I did that I was just thinking I was fixing some bugs.

boaks commented 5 years ago

But If you call scandium and tinyDTLS itself, I feel this is unappropriated.

My intention was focused on "role exchange".

sbernard31 commented 5 years ago

And I wanted to say => If you call "role exchange" in scandium and tinyDTLS itself, I feel this is not appropriated.

(I did some tests but I didn't implement it)

sbernard31 commented 2 years ago

@hannestschofenig, I feel a bit frustrated here because the issue is close but I still don't know what is the right way to handle a the situation I exposed above.

OpenMobileAlliance / OMA_LwM2M_for_Developers

How to handle DTLS connection lost at server side in none queue mode ? #410