Closed Ozame closed 5 months ago
First at all, the observe/notification is defined in RFC7641. There in 4.5 Server-Side Requirements - Transmission:
The client's acknowledgement of a confirmable notification signals that the client is interested in receiving further notifications. If a client rejects a confirmable or non-confirmable notification with a Reset message, or if the last attempt to retransmit a confirmable notification times out, then the client is considered no longer interested and the server MUST remove the associated entry from the list of observers.
For me this seems, that the implementation is in "spec", while your expectation isn't. Sometime not all cases may be covered in a specific way, because for others an other way is preferable.
What makes me wonder in your description:
A "in transit notification" should be replaced by the "new notification" keeping the state of retransmission (number and current timeout, see ObserveLayer.NotificationController.onRetransmission()). But your description is different.
Could you please provide a capture to check that?
Client is stopped
What does that mean? Stop the java engine?
Is there a way to know, when an notification related exchange has completed, either successfully or by timing out?
Do you mean, the transmission of the notification? Then try "MessageObserver.onTransferComplete".
Anyway, I guess, once I understand, what is really done, I guess the answer will be simpler.
For me this seems, that the implementation is in "spec", while your expectation isn't. Sometime not all cases may be covered in a specific way, because for others an other way is preferable.
It indeed looks like that the canceling of observation in case of timed out notifications is the specified behavior, thanks for the clarification.
A "in transit notification" should be replaced by the "new notification" keeping the state of retransmission (number and current timeout, see ObserveLayer.NotificationController.onRetransmission()). But your description is different.
Is this onRetransmission
happening on every time the notification send is being retried? If so, our case is probably a bit different:
By client stop I mean that the (Leshan) client's CoapServer and it's endpoints are stopped, using their stop
method.
So the stop will stop the Endpoint, which in turn stops the Connector. In this situation, do you still think that the in-progress notification should be cleaned up by the next notification? In this case, the next notification will come after the CoapServer#start
is called, and a resource change triggered. I have a suspicion that the stop/start might affect things here.
I will see if I can provide a capture somehow, but might be difficult due to company policies :/
I guess, no one cared in the past 10 years about observe/notify and start/stop ;-).
In this situation, do you still think that the in-progress notification should be cleaned up by the next notification?
In stop, I don't think that some one should send a new notification, or?
I have currently no idea, what will be the best for observe/notify relations. For a common exchange it's easier, because that is more or less "short living". I guess the answer depends more the use-case for the stop/start and that then gives and idea about the intention according the observe/notify relation.
but might be difficult due to company policies
That's OK. I guess even a unit test with stop/start will show the issue. I'm currently not longer in a job, where I can spend the required time to fix such things in short term. Let's see, how long it takes.
By the way, I'm curious: in the end you want to emulated a "lot of clients". And the obvious solution would be to use nio (non-blocking). You preferred a work-around, which now shows some issues. Maybe for short term, If I want to simulate quite a lot clients (1.000.000 or so), I usually just use RPK clients and do frequently new handshakes. It depends on some settings(identity), but with that the connection and exchange store can be filled up a lot. And if it's filled, I don't see the large difference in have the 2000 client or 200.000 client really active.
I guess, no one cared in the past 10 years about observe/notify and start/stop ;-).
Yeah I guess this whole approach in our use case has been pretty much not the most common one :D
By the way, I'm curious: in the end you want to emulated a "lot of clients". And the obvious solution would be to use nio (non-blocking). You preferred a work-around, which now shows some issues.
Ok you got my attention, could you elaborate on the NIO approach you had in mind?
But yes, basically this start/stop workaround has been a pain, and if there would be some other way to look at this, while keeping the resource usage sensible, I would be very interested in hearing more. I guess the main reason for this approach was that that the Leshan/californium didn't really allow keeping the clients on all the time, CPU/thread resource wise. At least that was my impression, as the communication began to break the more Leshan clients we had on simultaneously. But again, this implementation of mine is probably very naive on some ways.
The requirements are basically that we need to always use the same DTLS session, x509 certificates, and handle these long-lasting observations relations.
could you elaborate on the NIO approach you had in mind?
You get already Simon's answer on that in your issue in Leshan. He pointed to the Californium issue #1203.
Californium already supports for many parts to use a shared thread-pool. But the synchronous DatagramSocket requires unfortunately to use a dedicated thread for the blocking receive function. For using Californium as central server that doesn't matter. But for using it for massive client endpoints it does.
NIO may overcome that. But it will come with some implementation and mainly test efforts.
Anyway, I guess, if you can spend some time in providing something as an unit-test for your issue, I will check, what could be done on it.
Just as fast shot, not tested, nor really sure, if it addresses the issue at all:
Add that snippet to the ObserveLayer.NotificationController
@Override
public void onCancel() {
exchange.execute(new Runnable() {
@Override
public void run() {
ObserveRelation relation = exchange.getRelation();
Response next = relation.getNextNotification(response, true);
if (next != null) {
next.cancel();
}
}
});
}
and then test it.
I tried to come up with some kind of unit test for this, but to be honest could not really wrap my head around the californium code. If you have some pointers on how to proceed with the testing, I can try again at some point, but for now I couldn't really do anything :(
Californium already supports for many parts to use a shared thread-pool. But the synchronous DatagramSocket requires unfortunately to use a dedicated thread for the blocking receive function. For using Californium as central server that doesn't matter. But for using it for massive client endpoints it does.
I guess this is the part that the Worker class handles inside the DTLSConnector? Since at least that seems to be outside the executors one can pass in via DTLSConnector#setExecutor
Some examples of unit-test around observer:
I guess this is the part that the Worker class handles inside the DTLSConnector?
Yes, DTLSConnector, Worker receiver and DTLSConnector, Receiver
If possible, please retest with PR #2246 and check, if this is fixing your issue.
Any new on this? Can we close it?
Hi, unfortunately I've been busy with other things and haven't been able to test this yet. Will try to do it this week and share the results.
As long as I know, that you plan to test it, I will keep it open.
I currently consider to schedule the minor release 3.12 for the 6. June. So, I fit's possible to get some feedback until that, it may be also possible to incorporate that in this next minor release.
Hiya. Replicating the situation is pretty difficult on the real environment, but based on the local testing I've done, this seems to do the trick. No more in transit - logging on the next notification. I will keep an eye out if the behavior is seen at some point.
I think this can be closed for now at least.
Thanks for all the help Achim, appreciate it 🙏
Background
We are using Leshan 2.0.0-M14 (Californium 3.9.1) on our LwM2M client implementation, communicating with a server using Leshan 2.0.0-M9 (Californium 3.7.0). We are actually running multiple clients on same machine, connecting to the same server instance.
Posting this here as this is regarding the way the observations are handled in Californium.
The issue
In our use case, the server observes some resources on the client. We have noticed that the notifications from the client are not sent to the server in a situation where the client (i.e. the client's CoapEndpoint) has been stopped while the previous notification was not yet ACKed.
For example:
Client is started, and server observes a resource of the client.
Client sends a notification on resource change, but the server, for some reason, does not answer immediately.
Client is stopped, whilst the server has still not acked the notification.
Later, client is started again. It tries to send a new observation to the server.
The notification send fails, as the previous notification is still "in transit".
From our clients' logs:
When we stop the LWM2M client, it calls the
stop
method on the CoapServer. Relevant Leshan snippet in here.Looks to me like the
ObserveRelation
is keeping the previous notification, despite it being in "canceled" state, and then this leads to postponing the future notifications in here.Questions
stop
orstart
, so that it won't cause this on later notifications?