Closed rahulagarwal02 closed 2 years ago
That indicates, that a outgoing message with MID 3303 is already "pending".
That either indicates, that the exchange-cleanup leaks (I'm not aware of such bugs), or "too much messages" are sent, which is either not indicated by the numbers you provided.
The MID is a 16 bit value, so usually 65000 messages could be sent. With a default EXCHANGE_LIFETIME of 247s that means "65000 in 247s".
If I use your values 1,5 h 10 msgs/s that results n 54000 messages at all. Did you adapt the value of EXCHANGE_LIFETIME? Could you please provide the used "Californium.properties"?
And consider to migrate to 3.4.0. :-).
That indicates, that a outgoing message with MID 3303 is already "pending".
That either indicates, that the exchange-cleanup leaks (I'm not aware of such bugs), or "too much messages" are sent, which is either not indicated by the numbers you provided.
The MID is a 16 bit value, so usually 65000 messages could be sent. With a default EXCHANGE_LIFETIME of 247s that means "65000 in 247s".
If I use your values 1,5 h 10 msgs/s that results n 54000 messages at all. Did you adapt the value of EXCHANGE_LIFETIME? Could you please provide the used "Californium.properties"?
Thanks for the response. How does the exchange cleanup happens? I am using Qt CoAP client. Is there anything needs to be done on client side for cleanup? Attaching the Californium.properties file. Californium.properties.txt
How does the exchange cleanup happens?
The implementation expects that InMemoryMessageExchangeStore.remove(final KeyMID messageId, final Exchange exchange)
is called. On success path, that's done receiving the ACK for the CON Notify. if that ACK is missing, then the server is intended to cancel the observation and removes with that the exchange.
Do you always use CON notifies? Does your client reregister observes?
I'm not sure, what happens. Is it possible for you to provide a CAPTURE? Only for the observe registration and a couple of notifies and ACKs. e.g. about 50 messages or so.
How does the exchange cleanup happens?
The implementation expects that
InMemoryMessageExchangeStore.remove(final KeyMID messageId, final Exchange exchange)
is called. On success path, that's done receiving the ACK for the CON Notify. if that ACK is missing, then the server is intended to cancel the observation and removes with that the exchange.Do you always use CON notifies? Does your client reregister observes?
I'm not sure, what happens. Is it possible for you to provide a CAPTURE? Only for the observe registration and a couple of notifies and ACKs. e.g. about 50 messages or so..
Yes. Only using CON notify. When the issue happens, client detects and re-registers observer. Then messages start coming to client but stops again after around same time (1.5 hr) I will try to get the capture. Is there any log I can check to confirm if ACK is missing?
RFC 7641 - 4.5.2 Advanced Transmission
For this purpose, the server MAY optimize the transmission process by aborting the transmission of the old notification (but not before the current transmission attempt is completed) and starting a new transmission for the new notification (but with the retransmission timer and counter of the aborted transmission retained).
and
ACK_TIMEOUT=10000
With that, a missing ACK would cause also a quiet phase of 10s and you will probably lose 100 notifies.
Without the capture, it will be very hard to reproduce this. At least you will have to be very patient without capture, until I have time for searching for it.
Just to mention:
Beside the MID/Exchange house-keeping, the second source may be a call of Message.setMID(int mid) for some messages sent to that peer. You may check that.
I tried to reproduce it similar to your setup. Therefore I reduced the timer time PlugtestServer Observe from 5000 to 100. Using cf-browser-3.4.0 for the observing client didn't show the behavior you described. If possible, you may also test you client against the modified plugtest-server?
Just to mention:
Beside the MID/Exchange house-keeping, the second source may be a call of Message.setMID(int mid) for some messages sent to that peer. You may check that.
Checked that. We are not calling setMID
RFC 7641 - 4.5.2 Advanced Transmission
For this purpose, the server MAY optimize the transmission process by aborting the transmission of the old notification (but not before the current transmission attempt is completed) and starting a new transmission for the new notification (but with the retransmission timer and counter of the aborted transmission retained).
and
ACK_TIMEOUT=10000
With that, a missing ACK would cause also a quiet phase of 10s and you will probably lose 100 notifies.
Once the exception which I had mentioned in the query occurs, after that notifies permanently stops coming until we re-register observer
Once the exception which I had mentioned in the query occurs, after that notifies permanently stops coming until we re-register observer
Really strange ... I still have no guess, what have caused that ... If possible, could you try to retest with your client and the modified plugtest-server as mentioned above? Maybe, with the 3.4.0 first. If it doesn't appear there, then maybe with the 2.6.3 (but I would recommend to update to 3.4.0)
Once the exception which I had mentioned in the query occurs, after that notifies permanently stops coming until we re-register observer
Really strange ... I still have no guess, what have caused that ... If possible, could you try to retest with your client and the modified plugtest-server as mentioned above? Maybe, with the 3.4.0 first. If it doesn't appear there, then maybe with the 2.6.3 (but I would recommend to update to 3.4.0)
Sure.. trying both the things. Will let you know the result. Thanks
Checked that. We are not calling setMID
And you also don't reuse some "Response" instances?
Anyway, I will add some diagnose to the 3.5.0, that in the future, it gets easier to see, if the MID is also provided outside of the mid-provider.
A first trace: I used now a smaller time (25ms), below my RTT (30ms). That shows some unintended delays.
Edit: sorry, I just mixed up the destination, still no glue, what causes that error.
Additional Info: We have 4 clients which registers observer to server for same end point. Out of these, 3 clients are using californium CoAP library. one client uses Qt CoAP library. We are not facing any issues with californium clients. This issue occurs only with the client using Qt Coap Client
We are not facing any issues with californium clients. This issue occurs only with the client using Qt Coap Client
That makes it hard for me to test. The QT Coap Client is your development? or is it public available?
I assume, that even if the client makes something wrong, that triggers a bug on the server side. If that bug could be located, then it would be possible to fix it, at least for 3.5.0.
We are not facing any issues with californium clients. This issue occurs only with the client using Qt Coap Client
That makes it hard for me to test. The QT Coap Client is your development? or is it public available?
I assume, that even if the client makes something wrong, that triggers a bug on the server side. If that bug could be located, then it would be possible to fix it, at least for 3.5.0.
We are using a commercial license but it's available under GNU General Public license. https://doc-snapshots.qt.io/qtcoap/index.html
Thanks. Do you know, if that test could be executed with the example client?
Thanks. Do you know, if that test could be executed with the example client?
Yes. Should be possible. We are using the example client as a reference for development. We need to create a observer by providing the URL
Are there binaries ready for download? I didn't find ones. (I don't want to setup the toolchain and build the examples on my own, sorry ...)
Just as additional idea.
DEDUPLICATOR_AUTO_REPLACE=true
retest with this set to false
.
QT => 11 GB ... nightmare ... ... and the installation finally "fails on my machine".
QT => 11 GB ... nightmare ... ... and the installation finally "fails on my machine".
yes.. it's a tedious work :)
Just as additional idea.
DEDUPLICATOR_AUTO_REPLACE=true
retest with this set to
false
.
ok.. will try it
(I wasn't able to build the client. So it will take some time, until I will try that again.)
Any update from your side? Does the issue occur using the modified plugtest-server?
(I wasn't able to build the client. So it will take some time, until I will try that again.)
Any update from your side? Does the issue occur using the modified plugtest-server?
We upgraded to 3.4.0 version and also tried with DEDUPLICATOR_AUTO_REPLACE as false. Still issue occurred. Yet to try with the modified plugtest-server. Also trying to get the capture.
Any updates? Does "generated mid [3303] already in use, cannot register Exchange[R14]" occur with the plugtest-server?
Any updates? Does "generated mid [3303] already in use, cannot register Exchange[R14]" occur with the plugtest-server?
Got some more clue on when this issue occurs: Client registers 4 observers on different end points. Out of these 4, server is not sending any data on 2 since those are based on some events. Out of remaining 2, on one end point data is sent every ~100 ms. On another one its sent every ~250 ms
This issue occurs when there are more than 3 end points registered. When we remove one end point registration (where no data transfer is happening), this issue doesn't occur.
Just to make sure, it's still the "generated mid [????] already in use, cannot register Exchange[???]"? yes?
And the issue occurs with the plugtest-server?
Are you using plain coap? Or coap over dtls?
I add the information about the other peer's identity to the diagnose message.
Please update to the current master.
And, also please report, if this occurs only with your server, or the plugtest-server as well.
If it doesn't occur with the plugtest-server, and your server doesn't set the MID, then maybe you use a specific CoapEndpoint
setup. Therefore, please also report, if you use plain coap or coap over dtls.
I add the information about the other peer's identity to the diagnose message.
Please update to the current master.
And, also please report, if this occurs only with your server, or the plugtest-server as well. If it doesn't occur with the plugtest-server, and your server doesn't set the MID, then maybe you use a specific
CoapEndpoint
setup. Therefore, please also report, if you use plain coap or coap over dtls.
We are using plain coap. We are using coap for communication between services running on the same device. All the communication is on localhost. Will let you know the result with plugtest-server soon.
Will let you know the result with plugtest-server soon.
Any results update?
Will let you know the result with plugtest-server soon.
Any results update?
Sorry for late response. Can you guide me how to use plugtest-server or is there any document I can refer. To match with the actual scenario, server needs to create few endpoints. There will be multiple clients registering observers to these end points. From server we need to send data periodically every ~100 ms.
I have got the capture but unable to attach the file. Even .zip file not getting attached. Is there any other way to share it?
"drag and drop" just into the comment field doesn't work?
"drag and drop" just into the comment field doesn't work?
No.. tried multiple times
I tried it right now, works on my side. How large is your zip? (And it has the ending ".zip")?
And it is the capture from the plugstest-server?
And it is the capture from the plugstest-server?
No.. its from the actual server and client where we are seeing the issue
I tried it right now, works on my side. How large is your zip? (And it has the ending ".zip")?
Its 25 MB.. yes it has .zip ending
With the plugtest server the issue can't be reproduced? Or do you need some instructions?
With the plugtest server the issue can't be reproduced? Or do you need some instructions?
Yes. Need some inputs on how to use it.
25MB, I'm not sure, if that helps. At least I would not try to read through. I would propose, that you try to select the coap messages from the client, where the issue occurs. And then just the messages with the reported MID. See wireshark.
But before that, you should check the plugtest-server.
Yes. Need some inputs on how to use it.
Server should start.
Maybe, there is one more source of "generated mid [3303] already in use, cannot register Exchange":
reusing a Message
, I guess here the Response
. The first time the Response
instance is sent, a MID is assigned to that Response
instance. If then, later the instance is "reused" (which is not considered to be valid), then it may register the old MID. At that point in time, it's OK. If now the auto-generated MID gets the same value, then the error is reported.
So, please check, if you always create a fresh Response
.
25MB, I'm not sure, if that helps. At least I would not try to read through. I would propose, that you try to select the coap messages from the client, where the issue occurs. And then just the messages with the reported MID. See wireshark.
But before that, you should check the plugtest-server. capture.zip
I had kept the system overnight so the capture file size increased. I filtered the capture till the time issue occurred. Size has reduced and I am able to attach it. Snapshot image shows the last messages client received after which messages stopped coming from the server for the two endpoints. In the capture, Port 5401 belongs to the server using californium coap library. Port 59549 is the Qt coap client which is facing the issue.
Please check if this capture gives any clue.
Any reason to start with a NON observer request? That is responded also with a NON response, which has the same MID as the last CON notify. At least a first idea, what may cause the trouble in Californium. Maybe, until that gets analyzed (and fixed, if wrong), you can try to use a CON request to start the observation?
(Additionally, the empty ACKs are including tokens, which doesn't makes sense. Maybe worth to report that to report that there.)
I added a unit test for the GET/NON and Notify/CON. I still can't reproduce the issue.
Are you able to reproduce it with the plugtest-server?
Any reason to start with a NON observer request? That is responded also with a NON response, which has the same MID as the last CON notify. At least a first idea, what may cause the trouble in Californium. Maybe, until that gets analyzed (and fixed, if wrong), you can try to use a CON request to start the observation?
(Additionally, the empty ACKs are including tokens, which doesn't makes sense. Maybe worth to report that to report that there.)
I am trying with CON request to start the observation. Will let you know the result.
I am using Californium CoAP library as a server. Client runs a Qt CoAP client. Data rate is around 10 messages/sec Client receives messages for almost 1.5 hr after which message stops coming. Looking at logs getting below exception:
2022-03-21T11:02:28,516 WARN [CoapServer(main)#1] {BaseCoapStack.java:115} - error send response CON-2.05 MID= 3303, Token=10F6A0C91DAF, OptionSet={"Observe":18992, "Content-Format":"text/plain"}, "{"fhr1":-32768,"fhr2":-3".. 116 bytes java.lang.IllegalArgumentException: generated mid [3303] already in use, cannot register Exchange[R14] at org.eclipse.californium.core.network.InMemoryMessageExchangeStore.registerWithMessageId(InMemoryMessageExchangeStore.java:245) ~[californium-core-2.6.3.jar:?] at org.eclipse.californium.core.network.InMemoryMessageExchangeStore.registerOutboundResponse(InMemoryMessageExchangeStore.java:409) ~[californium-core-2.6.3.jar:?] at org.eclipse.californium.core.network.UdpMatcher.sendResponse(UdpMatcher.java:157) ~[californium-core-2.6.3.jar:?] at org.eclipse.californium.core.network.CoapEndpoint$OutboxImpl.sendResponse(CoapEndpoint.java:1088) ~[californium-core-2.6.3.jar:?] at org.eclipse.californium.core.network.stack.BaseCoapStack$StackBottomAdapter.sendResponse(BaseCoapStack.java:236) ~[californium-core-2.6.3.jar:?] at org.eclipse.californium.core.network.stack.ReliabilityLayer.sendResponse(ReliabilityLayer.java:153) ~[californium-core-2.6.3.jar:?] at org.eclipse.californium.core.network.stack.BlockwiseLayer.sendResponse(BlockwiseLayer.java:624) ~[californium-core-2.6.3.jar:?] at org.eclipse.californium.core.network.stack.ObserveLayer.sendResponse(ObserveLayer.java:123) ~[californium-core-2.6.3.jar:?] at org.eclipse.californium.core.network.stack.AbstractLayer.sendResponse(AbstractLayer.java:74) ~[californium-core-2.6.3.jar:?] at org.eclipse.californium.core.network.stack.ExchangeCleanupLayer.sendResponse(ExchangeCleanupLayer.java:83) ~[californium-core-2.6.3.jar:?] at org.eclipse.californium.core.network.stack.BaseCoapStack$StackTopAdapter.sendResponse(BaseCoapStack.java:194) ~[californium-core-2.6.3.jar:?] at org.eclipse.californium.core.network.stack.BaseCoapStack.sendResponse(BaseCoapStack.java:110) [californium-core-2.6.3.jar:?] at org.eclipse.californium.core.network.CoapEndpoint$9.run(CoapEndpoint.java:912) [californium-core-2.6.3.jar:?] at org.eclipse.californium.elements.util.SerialExecutor$1.run(SerialExecutor.java:289) [element-connector-2.6.3.jar:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:1.8.0_242-internal] at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:1.8.0_242-internal] at java.lang.Thread.run(Unknown Source) [?:1.8.0_242-internal]
What could be the issue?