eclipse-californium / californium

CoAP/DTLS Java Implementation
https://www.eclipse.org/californium/
Other
730 stars 367 forks source link

IllegalStateException: automatic message IDs exhausted #2167

Closed haegThe closed 10 months ago

haegThe commented 1 year ago

I also encountered this problem when doing performance testing! When the number of client is above 10, after sent about 60000 message, client throws exception:java.lang.IllegalStateException: automatic message IDs exhausted.

In californium 3.8.0, the default value of MAX_ACTIVE_PEERS is 150000.

I have tried many config mentioned in issue #1051 etc.., please keep a mind on this issue! thinks!

- californium version:3.8.0

- server: Californium3.properties has no config。 just receive message and send response。

CoapServer server = new CoapServer(5683);
        server.start();

        CoapResource path_sys = new CoapResource("sys"); 

        server.add(path_sys.add(new CoapProcess("properties").
                add(new CoapProcess("report"))));
public class CoapProcess extends CoapResource {
    public CoapProcess(String name) {
        super(name);
    }
    @Override
    public void handlePOST(CoapExchange exchange) {
        exchange.respond(CoAP.ResponseCode.CREATED, "read property message error with CoAP gateway!",50);
    }
}

- client There are a total of 1000 clients, and each client sends 1 message per second。

public class LocalCoapTest extends Thread {

    public static void main(String[] args) {
        messageStatisticsScheduler.scheduleAtFixedRate(new Runnable() {
            public void run() {
                System.out.println("send messages count: " + count.get());
            }
        }, 10, 10, TimeUnit.SECONDS);

        for (int i=0; i< 1000; i++) {
            CoapClient client = new CoapClient();
            LocalCoapTest localCoapTest = new LocalCoapTest(client);
            localCoapTest.start();
        }
    }
    private CoapClient client;
    public LocalCoapTest(CoapClient client) {
        this.client = client;
        this.client.setURI(gatewayPropertyUri);
    }

    @Override
    public void run() {
        while (true) {
            sendGatewayProperty();
            count.incrementAndGet();
            try {
                Thread.sleep(1000);
            } catch (InterruptedException e) {
                throw new RuntimeException(e);
            }
        }
    }
    private void sendGatewayProperty() {

        CoapResponse response = null;
        try {
            response = client.useCONs().post(data_gateway_property, 50);
            response.advanced().getMID();
            if (response != null){
            } else {
                System.out.println("response is null!");
            }
        } catch (Exception e) {
            System.out.println(e.getMessage());
            System.out.println(count.get());
            Thread.currentThread().stop();
        }
    }
boaks commented 1 year ago

CoapClient client = new CoapClient();

Unfortunately, these clients are sharing their EffectiveEndpoint. On the CoAP layer, that's then only 1 client.

To fix this, you may add separate Endpoints.

CoapClient client = new CoapClient();
client.setEndpoint(new CoapEndpoint.Builder().build());
LocalCoapTest localCoapTest = new LocalCoapTest(client);
localCoapTest.start();

Just in the case, you add a

client.shutdown();

please also destroy these separate Endpoints.

client.getEndpoint().destroy();
client.shutdown();
haegThe commented 1 year ago

CoapClient client = new CoapClient();

Unfortunately, these clients are sharing their EffectiveEndpoint. On the CoAP layer, that's then only 1 client.

To fix this, you may add separate Endpoints.

CoapClient client = new CoapClient();
client.setEndpoint(new CoapEndpoint.Builder().build());
LocalCoapTest localCoapTest = new LocalCoapTest(client);
localCoapTest.start();

Just in the case, you add a

client.shutdown();

please also destroy these separate Endpoints.

client.getEndpoint().destroy();
client.shutdown();

It works! Thank you!

But I found another serious problem: the server always doing full GC!

So, I stopped clients, after a few minites, the jvm old area returns to normal!

The server does not have any business logic. server jvm: Xmx1500m

root@master-1: jstat -gcutil 18303 1000
  S0     S1     E      O      M     CCS    YGC     YGCT    FGC    FGCT     GCT   
  0.00   0.00 100.00  99.99  97.31  92.90     18    1.733    14   17.702   19.435
  0.00   0.00 100.00  99.99  97.31  92.90     18    1.733    15   18.868   20.601
  0.00   0.00 100.00  99.99  97.31  92.90     18    1.733    16   20.006   21.739
  0.00   0.00 100.00  99.99  97.31  92.90     18    1.733    16   20.006   21.739
  0.00   0.00 100.00  99.99  97.31  92.90     18    1.733    17   21.145   22.879
  0.00   0.00 100.00  99.99  97.31  92.90     18    1.733    18   22.227   23.961
  0.00   0.00 100.00  99.99  97.31  92.90     18    1.733    19   23.309   25.043
  0.00   0.00 100.00  99.99  97.31  92.90     18    1.733    20   24.417   26.151
  0.00   0.00 100.00  99.99  97.31  92.90     18    1.733    21   25.542   27.275
  0.00   0.00 100.00  99.99  97.31  92.90     18    1.733    22   26.691   28.424
  0.00   0.00 100.00  99.99  97.31  92.90     18    1.733    23   27.805   29.538
  0.00   0.00 100.00  99.99  97.31  92.90     18    1.733    24   28.862   30.596
  0.00   0.00 100.00  99.99  97.31  92.90     18    1.733    25   30.001   31.734
  0.00   0.00  99.99  99.99  97.31  92.90     18    1.733    25   31.087   32.820
  0.00   0.00 100.00  99.99  97.31  92.90     18    1.733    26   31.087   32.820
  0.00   0.00 100.00  99.99  97.31  92.90     18    1.733    27   32.187   33.920
  0.00   0.00 100.00  99.99  97.31  92.90     18    1.733    28   33.254   34.987
  0.00   0.00 100.00  99.99  97.31  92.90     18    1.733    29   34.364   36.097
root@master-1:/opt/iotda/holli-iot-transport-1.6.0# jmap -histo:live 18303 | head -30

 num     #instances         #bytes  class name
----------------------------------------------
   1:       1764039      262568256  [B
   2:       2480236      144676760  [C
   3:        705378       73359312  org.eclipse.californium.core.coap.OptionSet
   4:       2480209       59525016  java.lang.String
   5:       2468893       59253432  java.util.LinkedList$Node
   6:        705747       56469880  [Ljava.util.HashMap$Node;
   7:        352698       56431680  org.eclipse.californium.core.network.Exchange
   8:       1410853       45147296  java.util.concurrent.locks.ReentrantLock$NonfairSync
   9:        352698       45145344  org.eclipse.californium.core.coap.Request
  10:        352680       42321600  org.eclipse.californium.core.coap.Response
  11:        705642       33870816  java.util.HashMap
  12:       1763480       28215680  java.util.concurrent.atomic.AtomicReference
  13:       1058105       25394520  java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject
  14:        706828       22618496  java.util.HashMap$Node
  15:        705442       22574144  java.util.Collections$UnmodifiableMap
  16:        705440       22574080  org.eclipse.californium.elements.UdpEndpointContext
  17:       1410853       22573648  java.util.concurrent.locks.ReentrantLock
  18:        705402       22572864  java.util.LinkedList
  19:        352701       16929648  java.util.concurrent.LinkedBlockingQueue
  20:        352698       16929504  org.eclipse.californium.elements.util.SerialExecutor
  21:       1058078       16929248  java.util.concurrent.atomic.AtomicBoolean
  22:        356578       11410496  java.util.concurrent.ConcurrentHashMap$Node
  23:        705446       11287136  java.util.HashMap$KeySet
  24:        705441       11287056  java.util.Collections$UnmodifiableSet
  25:        352699       11286368  org.eclipse.californium.core.coap.Token
  26:        354774        8575032  [Ljava.lang.Object;
  27:        352762        8466288  java.net.InetSocketAddress$InetSocketAddressHolder
boaks commented 1 year ago

It works! Thank you!

You're welcome.

But I found another serious problem: the server always doing full GC! So, I stopped clients, after a few minites, the jvm old area returns to normal!

You found something, which is discussed for years ;-). CoAP defines the processing of "deduplication" (CON messages) and therefore keeps the requests per standard for about 247 s (EXCHANGE LIFETIME). The definition is based on a couple of assumptions about the network and processing power and doesn't fit always. Though the discussion spans a long time, Californium has implemented several means in order to reduce that heap fingerprint.

# Deduplicator algorithm.
# [MARK_AND_SWEEP, PEERS_MARK_AND_SWEEP, CROP_ROTATION, NO_DEDUPLICATOR].
# Default: MARK_AND_SWEEP
COAP.DEDUPLICATOR=PEERS_MARK_AND_SWEEP
...
# Maximum messages kept per peer for PEERS_MARK_AND_SWEEP.
# Default: 64
COAP.PEERS_MARK_AND_SWEEP_MESSAGES=16

server jvm: Xmx1500m

For java, a server loadtest with 1500m doesn't make too much sense. Java speeds up with heap, I use at lest 10g for servers. With 16 GB RAM and 4x3.5GHz CPU, processing 70.000 request/s including encryption (CoAP/DTLS 1.2 CID) works pretty stable.

To help you more, you may start to tell me, what you are really doing.

haegThe commented 1 year ago

It works! Thank you!

You're welcome.

But I found another serious problem: the server always doing full GC! So, I stopped clients, after a few minites, the jvm old area returns to normal!

You found something, which is discussed for years ;-). CoAP defines the processing of "deduplication" (CON messages) and therefore keeps the requests per standard for about 247 s (EXCHANGE LIFETIME). The definition is based on a couple of assumptions about the network and processing power and doesn't fit always. Though the discussion spans a long time, Californium has implemented several means in order to reduce that heap fingerprint.

  • message offloading, removes the parts of the messages, which are not needed for deduplication
  • use smaller EXCHANGE LIFETIME (configure that in "Californium3.properties")
  • use a different Deduplication algorithm (see "Californium3.properties"):
# Deduplicator algorithm.
# [MARK_AND_SWEEP, PEERS_MARK_AND_SWEEP, CROP_ROTATION, NO_DEDUPLICATOR].
# Default: MARK_AND_SWEEP
COAP.DEDUPLICATOR=PEERS_MARK_AND_SWEEP
...
# Maximum messages kept per peer for PEERS_MARK_AND_SWEEP.
# Default: 64
COAP.PEERS_MARK_AND_SWEEP_MESSAGES=16

server jvm: Xmx1500m

For java, a server loadtest with 1500m doesn't make too much sense. Java speeds up with heap, I use at lest 10g for servers. With 16 GB RAM and 4x3.5GHz CPU, processing 70.000 request/s including encryption (CoAP/DTLS 1.2 CID) works pretty stable.

To help you more, you may start to tell me, what you are really doing.

Thank you very much for your prompt reply! my problem is solved!

haegThe commented 1 year ago

COAP.DEDUPLICATOR=PEERS_MARK_AND_SWEEP COAP.PEERS_MARK_AND_SWEEP_MESSAGES=16

I used above config to do performance test, linux vm is 16core, jvm is 6G, a message is 700+bytes. TPS is only 3w/s, and not stable. What shout i do? use custom thread pool ?

Please point to the problem, thank you!

haegThe commented 1 year ago

oap.peers.mark.and.sweep.messages = 16 coap.deduplicator = PEERS_MARK_AND_SWEEP

Use above config in docker with 6core/6G, a message is about 700+bytes, tps is about 15000/s, OOM after a few minites. image

boaks commented 1 year ago

Californium doesn't throttle the traffic considering the available heap. That's not implemented. In the experience of many years, the ratio between cores and RAM must consider enough RAM. My setup is 4 cores and 12 GB RAM.

In your case, you may use additional:

USE_MESSAGE_OFFLOADING=true
boaks commented 1 year ago

Any news on this?

As long as CoAP needs to keep track of the MIDs and the related responses, the number of processed messages per time is limited by the available heap. Californium is in many cases not used "on it's own", so the the heap constraints of the other code must be also obeyed. That was the point, where it got clear, that Californium will not be able to sort out such OOM on it's own. The easiest and robustest approach was to test the ratio of CPU vs. RAM and use more RAM. The other topic is then a java topic, that's the question, which JRE is used to run it and which GC variants are available for that. My own usage is mainly with java 17 and GC1.

boaks commented 10 months ago

Don't hesitate to comment in this issues if you have news.