eclipse-californium / californium

CoAP/DTLS Java Implementation
https://www.eclipse.org/californium/
Other
724 stars 363 forks source link

generated mid [3303] already in use, cannot register Exchange[R14] #1964

Closed rahulagarwal02 closed 2 years ago

rahulagarwal02 commented 2 years ago

I am using Californium CoAP library as a server. Client runs a Qt CoAP client. Data rate is around 10 messages/sec Client receives messages for almost 1.5 hr after which message stops coming. Looking at logs getting below exception:

2022-03-21T11:02:28,516 WARN [CoapServer(main)#1] {BaseCoapStack.java:115} - error send response CON-2.05 MID= 3303, Token=10F6A0C91DAF, OptionSet={"Observe":18992, "Content-Format":"text/plain"}, "{"fhr1":-32768,"fhr2":-3".. 116 bytes java.lang.IllegalArgumentException: generated mid [3303] already in use, cannot register Exchange[R14] at org.eclipse.californium.core.network.InMemoryMessageExchangeStore.registerWithMessageId(InMemoryMessageExchangeStore.java:245) ~[californium-core-2.6.3.jar:?] at org.eclipse.californium.core.network.InMemoryMessageExchangeStore.registerOutboundResponse(InMemoryMessageExchangeStore.java:409) ~[californium-core-2.6.3.jar:?] at org.eclipse.californium.core.network.UdpMatcher.sendResponse(UdpMatcher.java:157) ~[californium-core-2.6.3.jar:?] at org.eclipse.californium.core.network.CoapEndpoint$OutboxImpl.sendResponse(CoapEndpoint.java:1088) ~[californium-core-2.6.3.jar:?] at org.eclipse.californium.core.network.stack.BaseCoapStack$StackBottomAdapter.sendResponse(BaseCoapStack.java:236) ~[californium-core-2.6.3.jar:?] at org.eclipse.californium.core.network.stack.ReliabilityLayer.sendResponse(ReliabilityLayer.java:153) ~[californium-core-2.6.3.jar:?] at org.eclipse.californium.core.network.stack.BlockwiseLayer.sendResponse(BlockwiseLayer.java:624) ~[californium-core-2.6.3.jar:?] at org.eclipse.californium.core.network.stack.ObserveLayer.sendResponse(ObserveLayer.java:123) ~[californium-core-2.6.3.jar:?] at org.eclipse.californium.core.network.stack.AbstractLayer.sendResponse(AbstractLayer.java:74) ~[californium-core-2.6.3.jar:?] at org.eclipse.californium.core.network.stack.ExchangeCleanupLayer.sendResponse(ExchangeCleanupLayer.java:83) ~[californium-core-2.6.3.jar:?] at org.eclipse.californium.core.network.stack.BaseCoapStack$StackTopAdapter.sendResponse(BaseCoapStack.java:194) ~[californium-core-2.6.3.jar:?] at org.eclipse.californium.core.network.stack.BaseCoapStack.sendResponse(BaseCoapStack.java:110) [californium-core-2.6.3.jar:?] at org.eclipse.californium.core.network.CoapEndpoint$9.run(CoapEndpoint.java:912) [californium-core-2.6.3.jar:?] at org.eclipse.californium.elements.util.SerialExecutor$1.run(SerialExecutor.java:289) [element-connector-2.6.3.jar:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:1.8.0_242-internal] at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:1.8.0_242-internal] at java.lang.Thread.run(Unknown Source) [?:1.8.0_242-internal]

What could be the issue?

boaks commented 2 years ago

That indicates, that a outgoing message with MID 3303 is already "pending".

That either indicates, that the exchange-cleanup leaks (I'm not aware of such bugs), or "too much messages" are sent, which is either not indicated by the numbers you provided.

The MID is a 16 bit value, so usually 65000 messages could be sent. With a default EXCHANGE_LIFETIME of 247s that means "65000 in 247s".

If I use your values 1,5 h 10 msgs/s that results n 54000 messages at all. Did you adapt the value of EXCHANGE_LIFETIME? Could you please provide the used "Californium.properties"?

boaks commented 2 years ago

And consider to migrate to 3.4.0. :-).

rahulagarwal02 commented 2 years ago

That indicates, that a outgoing message with MID 3303 is already "pending".

That either indicates, that the exchange-cleanup leaks (I'm not aware of such bugs), or "too much messages" are sent, which is either not indicated by the numbers you provided.

The MID is a 16 bit value, so usually 65000 messages could be sent. With a default EXCHANGE_LIFETIME of 247s that means "65000 in 247s".

If I use your values 1,5 h 10 msgs/s that results n 54000 messages at all. Did you adapt the value of EXCHANGE_LIFETIME? Could you please provide the used "Californium.properties"?

Thanks for the response. How does the exchange cleanup happens? I am using Qt CoAP client. Is there anything needs to be done on client side for cleanup? Attaching the Californium.properties file. Californium.properties.txt

boaks commented 2 years ago

How does the exchange cleanup happens?

The implementation expects that InMemoryMessageExchangeStore.remove(final KeyMID messageId, final Exchange exchange) is called. On success path, that's done receiving the ACK for the CON Notify. if that ACK is missing, then the server is intended to cancel the observation and removes with that the exchange.

Do you always use CON notifies? Does your client reregister observes?

I'm not sure, what happens. Is it possible for you to provide a CAPTURE? Only for the observe registration and a couple of notifies and ACKs. e.g. about 50 messages or so.

rahulagarwal02 commented 2 years ago

How does the exchange cleanup happens?

The implementation expects that InMemoryMessageExchangeStore.remove(final KeyMID messageId, final Exchange exchange) is called. On success path, that's done receiving the ACK for the CON Notify. if that ACK is missing, then the server is intended to cancel the observation and removes with that the exchange.

Do you always use CON notifies? Does your client reregister observes?

I'm not sure, what happens. Is it possible for you to provide a CAPTURE? Only for the observe registration and a couple of notifies and ACKs. e.g. about 50 messages or so..

Yes. Only using CON notify. When the issue happens, client detects and re-registers observer. Then messages start coming to client but stops again after around same time (1.5 hr) I will try to get the capture. Is there any log I can check to confirm if ACK is missing?

boaks commented 2 years ago

RFC 7641 - 4.5.2 Advanced Transmission

For this purpose, the server MAY optimize the transmission process by aborting the transmission of the old notification (but not before the current transmission attempt is completed) and starting a new transmission for the new notification (but with the retransmission timer and counter of the aborted transmission retained).

and

ACK_TIMEOUT=10000

With that, a missing ACK would cause also a quiet phase of 10s and you will probably lose 100 notifies.

boaks commented 2 years ago

Without the capture, it will be very hard to reproduce this. At least you will have to be very patient without capture, until I have time for searching for it.

boaks commented 2 years ago

Just to mention:

Beside the MID/Exchange house-keeping, the second source may be a call of Message.setMID(int mid) for some messages sent to that peer. You may check that.

boaks commented 2 years ago

I tried to reproduce it similar to your setup. Therefore I reduced the timer time PlugtestServer Observe from 5000 to 100. Using cf-browser-3.4.0 for the observing client didn't show the behavior you described. If possible, you may also test you client against the modified plugtest-server?

rahulagarwal02 commented 2 years ago

Just to mention:

Beside the MID/Exchange house-keeping, the second source may be a call of Message.setMID(int mid) for some messages sent to that peer. You may check that.

Checked that. We are not calling setMID

rahulagarwal02 commented 2 years ago

RFC 7641 - 4.5.2 Advanced Transmission

For this purpose, the server MAY optimize the transmission process by aborting the transmission of the old notification (but not before the current transmission attempt is completed) and starting a new transmission for the new notification (but with the retransmission timer and counter of the aborted transmission retained).

and

ACK_TIMEOUT=10000

With that, a missing ACK would cause also a quiet phase of 10s and you will probably lose 100 notifies.

Once the exception which I had mentioned in the query occurs, after that notifies permanently stops coming until we re-register observer

boaks commented 2 years ago

Once the exception which I had mentioned in the query occurs, after that notifies permanently stops coming until we re-register observer

Really strange ... I still have no guess, what have caused that ... If possible, could you try to retest with your client and the modified plugtest-server as mentioned above? Maybe, with the 3.4.0 first. If it doesn't appear there, then maybe with the 2.6.3 (but I would recommend to update to 3.4.0)

rahulagarwal02 commented 2 years ago

Once the exception which I had mentioned in the query occurs, after that notifies permanently stops coming until we re-register observer

Really strange ... I still have no guess, what have caused that ... If possible, could you try to retest with your client and the modified plugtest-server as mentioned above? Maybe, with the 3.4.0 first. If it doesn't appear there, then maybe with the 2.6.3 (but I would recommend to update to 3.4.0)

Sure.. trying both the things. Will let you know the result. Thanks

boaks commented 2 years ago

Checked that. We are not calling setMID

And you also don't reuse some "Response" instances?

Anyway, I will add some diagnose to the 3.5.0, that in the future, it gets easier to see, if the MID is also provided outside of the mid-provider.

boaks commented 2 years ago

A first trace: I used now a smaller time (25ms), below my RTT (30ms). That shows some unintended delays.

Edit: sorry, I just mixed up the destination, still no glue, what causes that error.

rahulagarwal02 commented 2 years ago

Additional Info: We have 4 clients which registers observer to server for same end point. Out of these, 3 clients are using californium CoAP library. one client uses Qt CoAP library. We are not facing any issues with californium clients. This issue occurs only with the client using Qt Coap Client

boaks commented 2 years ago

We are not facing any issues with californium clients. This issue occurs only with the client using Qt Coap Client

That makes it hard for me to test. The QT Coap Client is your development? or is it public available?

I assume, that even if the client makes something wrong, that triggers a bug on the server side. If that bug could be located, then it would be possible to fix it, at least for 3.5.0.

rahulagarwal02 commented 2 years ago

We are not facing any issues with californium clients. This issue occurs only with the client using Qt Coap Client

That makes it hard for me to test. The QT Coap Client is your development? or is it public available?

I assume, that even if the client makes something wrong, that triggers a bug on the server side. If that bug could be located, then it would be possible to fix it, at least for 3.5.0.

We are using a commercial license but it's available under GNU General Public license. https://doc-snapshots.qt.io/qtcoap/index.html

boaks commented 2 years ago

Thanks. Do you know, if that test could be executed with the example client?

rahulagarwal02 commented 2 years ago

Thanks. Do you know, if that test could be executed with the example client?

Yes. Should be possible. We are using the example client as a reference for development. We need to create a observer by providing the URL

boaks commented 2 years ago

Are there binaries ready for download? I didn't find ones. (I don't want to setup the toolchain and build the examples on my own, sorry ...)

boaks commented 2 years ago

Just as additional idea.

DEDUPLICATOR_AUTO_REPLACE=true

retest with this set to false.

boaks commented 2 years ago

QT => 11 GB ... nightmare ... ... and the installation finally "fails on my machine".

rahulagarwal02 commented 2 years ago

QT => 11 GB ... nightmare ... ... and the installation finally "fails on my machine".

yes.. it's a tedious work :)

rahulagarwal02 commented 2 years ago

Just as additional idea.

DEDUPLICATOR_AUTO_REPLACE=true

retest with this set to false.

ok.. will try it

boaks commented 2 years ago

(I wasn't able to build the client. So it will take some time, until I will try that again.)

Any update from your side? Does the issue occur using the modified plugtest-server?

rahulagarwal02 commented 2 years ago

(I wasn't able to build the client. So it will take some time, until I will try that again.)

Any update from your side? Does the issue occur using the modified plugtest-server?

We upgraded to 3.4.0 version and also tried with DEDUPLICATOR_AUTO_REPLACE as false. Still issue occurred. Yet to try with the modified plugtest-server. Also trying to get the capture.

boaks commented 2 years ago

Any updates? Does "generated mid [3303] already in use, cannot register Exchange[R14]" occur with the plugtest-server?

rahulagarwal02 commented 2 years ago

Any updates? Does "generated mid [3303] already in use, cannot register Exchange[R14]" occur with the plugtest-server?

Got some more clue on when this issue occurs: Client registers 4 observers on different end points. Out of these 4, server is not sending any data on 2 since those are based on some events. Out of remaining 2, on one end point data is sent every ~100 ms. On another one its sent every ~250 ms

This issue occurs when there are more than 3 end points registered. When we remove one end point registration (where no data transfer is happening), this issue doesn't occur.

boaks commented 2 years ago

Just to make sure, it's still the "generated mid [????] already in use, cannot register Exchange[???]"? yes?

And the issue occurs with the plugtest-server?

Are you using plain coap? Or coap over dtls?

boaks commented 2 years ago

I add the information about the other peer's identity to the diagnose message.

Please update to the current master.

And, also please report, if this occurs only with your server, or the plugtest-server as well. If it doesn't occur with the plugtest-server, and your server doesn't set the MID, then maybe you use a specific CoapEndpoint setup. Therefore, please also report, if you use plain coap or coap over dtls.

rahulagarwal02 commented 2 years ago

I add the information about the other peer's identity to the diagnose message.

Please update to the current master.

And, also please report, if this occurs only with your server, or the plugtest-server as well. If it doesn't occur with the plugtest-server, and your server doesn't set the MID, then maybe you use a specific CoapEndpoint setup. Therefore, please also report, if you use plain coap or coap over dtls.

We are using plain coap. We are using coap for communication between services running on the same device. All the communication is on localhost. Will let you know the result with plugtest-server soon.

boaks commented 2 years ago

Will let you know the result with plugtest-server soon.

Any results update?

rahulagarwal02 commented 2 years ago

Will let you know the result with plugtest-server soon.

Any results update?

Sorry for late response. Can you guide me how to use plugtest-server or is there any document I can refer. To match with the actual scenario, server needs to create few endpoints. There will be multiple clients registering observers to these end points. From server we need to send data periodically every ~100 ms.

rahulagarwal02 commented 2 years ago

I have got the capture but unable to attach the file. Even .zip file not getting attached. Is there any other way to share it?

boaks commented 2 years ago

"drag and drop" just into the comment field doesn't work?

rahulagarwal02 commented 2 years ago

"drag and drop" just into the comment field doesn't work?

No.. tried multiple times

boaks commented 2 years ago

I tried it right now, works on my side. How large is your zip? (And it has the ending ".zip")?

boaks commented 2 years ago

And it is the capture from the plugstest-server?

rahulagarwal02 commented 2 years ago

And it is the capture from the plugstest-server?

No.. its from the actual server and client where we are seeing the issue

rahulagarwal02 commented 2 years ago

I tried it right now, works on my side. How large is your zip? (And it has the ending ".zip")?

Its 25 MB.. yes it has .zip ending

boaks commented 2 years ago

With the plugtest server the issue can't be reproduced? Or do you need some instructions?

rahulagarwal02 commented 2 years ago

With the plugtest server the issue can't be reproduced? Or do you need some instructions?

Yes. Need some inputs on how to use it.

boaks commented 2 years ago

25MB, I'm not sure, if that helps. At least I would not try to read through. I would propose, that you try to select the coap messages from the client, where the issue occurs. And then just the messages with the reported MID. See wireshark.

But before that, you should check the plugtest-server.

boaks commented 2 years ago

Yes. Need some inputs on how to use it.

  1. Pleased adapt the timer interval of the observe in the plugtest-server from 5000ms to 100ms, changing it locally (demo-apps/cf-plugtest-server) . PlugtestServer Observe.
  2. mvn clean install in the parents folder.
  3. cd to (demo-apps/cf-plugtest-server)
  4. java -jar target/cf-plugtest-server-3.5.0-SNAPSHOT.jar

Server should start.

  1. use your client and start the observer for "coap:///obs"
  2. check, that you receive notifies about every 100ms
  3. Wait until the issue occurs (or not)
boaks commented 2 years ago

Maybe, there is one more source of "generated mid [3303] already in use, cannot register Exchange": reusing a Message, I guess here the Response. The first time the Response instance is sent, a MID is assigned to that Response instance. If then, later the instance is "reused" (which is not considered to be valid), then it may register the old MID. At that point in time, it's OK. If now the auto-generated MID gets the same value, then the error is reported. So, please check, if you always create a fresh Response.

rahulagarwal02 commented 2 years ago

25MB, I'm not sure, if that helps. At least I would not try to read through. I would propose, that you try to select the coap messages from the client, where the issue occurs. And then just the messages with the reported MID. See wireshark.

But before that, you should check the plugtest-server. capture.zip Last messages

I had kept the system overnight so the capture file size increased. I filtered the capture till the time issue occurred. Size has reduced and I am able to attach it. Snapshot image shows the last messages client received after which messages stopped coming from the server for the two endpoints. In the capture, Port 5401 belongs to the server using californium coap library. Port 59549 is the Qt coap client which is facing the issue.

Please check if this capture gives any clue.

boaks commented 2 years ago

Any reason to start with a NON observer request? That is responded also with a NON response, which has the same MID as the last CON notify. At least a first idea, what may cause the trouble in Californium. Maybe, until that gets analyzed (and fixed, if wrong), you can try to use a CON request to start the observation?

(Additionally, the empty ACKs are including tokens, which doesn't makes sense. Maybe worth to report that to report that there.)

boaks commented 2 years ago

I added a unit test for the GET/NON and Notify/CON. I still can't reproduce the issue.

Are you able to reproduce it with the plugtest-server?

rahulagarwal02 commented 2 years ago

Any reason to start with a NON observer request? That is responded also with a NON response, which has the same MID as the last CON notify. At least a first idea, what may cause the trouble in Californium. Maybe, until that gets analyzed (and fixed, if wrong), you can try to use a CON request to start the observation?

(Additionally, the empty ACKs are including tokens, which doesn't makes sense. Maybe worth to report that to report that there.)

I am trying with CON request to start the observation. Will let you know the result.