eclipse / paho.mqtt.java

Eclipse Paho Java MQTT client library. Paho is an Eclipse IoT project.
https://eclipse.org/paho
Other
2.13k stars 886 forks source link

Significant increase of CPU usage after 1.2.0 to 1.2.1 update #754

Open FrVaBe opened 4 years ago

FrVaBe commented 4 years ago

Please fill out the form below before submitting, thank you!

This bug did not exist in Release Version 1.2.0.


I observe a significant increase of the CPU usage after upgrading the paho driver from version 1.2.0 to 1.2.1.

Initial Situation

The observations are made in a system with Java applications that publish and subscribe messages with a rate of about 1200 msg/sec. The application setup looks like this:

The applications are build with Java 11 (which is also used as runtime). The 'G1' Garbage Collector is used without any special configuration.

The hardware used to run all application instances is

The MQTT Broker (HiveMQ 3.3.4) and MongoDB (3.4.24) are installed on different machines.

Using paho 1.2.1 (or 1.2.2)

If the applications use version 1.2.1 (or 1.2.2) of the paho Java driver an extreme utilization of the system ressource (especially CPU) can be observed. The following screenshot of the top command illustrates the load:

grafik

The load average is about 45!

Maybe this is becaus of heavy Garbage Collecting. The GC-log of one example process is like this:

[2020-02-20T09:15:00.590+0000] GC(1348) Pause Full (System.gc()) 134M->16M(256M) 569.747ms
[2020-02-20T09:15:03.123+0000] GC(1349) Pause Young (Normal) (G1 Evacuation Pause) 169M->16M(256M) 6.368ms
[2020-02-20T09:15:06.221+0000] GC(1350) Pause Young (Normal) (G1 Evacuation Pause) 168M->16M(256M) 5.140ms
[2020-02-20T09:15:08.918+0000] GC(1351) Pause Young (Normal) (G1 Evacuation Pause) 168M->16M(256M) 7.980ms
[2020-02-20T09:15:11.823+0000] GC(1352) Pause Young (Normal) (G1 Evacuation Pause) 168M->16M(256M) 2.535ms
[2020-02-20T09:15:14.998+0000] GC(1353) Pause Young (Normal) (G1 Evacuation Pause) 168M->16M(256M) 3.370ms

Using paho 1.2.0

The average load of the system when using the paho client 1.2.0 in all applications is much nicer:

grafik

The average load is about 1.3 and also Garbage collecting does not take that long:

[2020-02-20T10:45:00.545+0000] GC(1614) Pause Full (System.gc()) 43M->17M(256M) 537.494ms
[2020-02-20T10:45:03.061+0000] GC(1615) Pause Young (Normal) (G1 Evacuation Pause) 170M->17M(256M) 1.849ms
[2020-02-20T10:45:06.048+0000] GC(1616) Pause Young (Normal) (G1 Evacuation Pause) 169M->17M(256M) 1.602ms
[2020-02-20T10:45:08.889+0000] GC(1617) Pause Young (Normal) (G1 Evacuation Pause) 169M->17M(256M) 1.418ms
[2020-02-20T10:45:11.678+0000] GC(1618) Pause Young (Normal) (G1 Evacuation Pause) 169M->17M(256M) 1.628ms
[2020-02-20T10:45:14.540+0000] GC(1619) Pause Young (Normal) (G1 Evacuation Pause) 169M->17M(256M) 2.673ms

Conclusions

The update of the paho java client from version 1.2.0 to 1.2.1 has a major impact on the performance of our overall system. Therefore we are not able to migrate to a current paho version, although relevant bug fixes have been applied.

Why the system behaves this way with the new paho version could not be clarified at this point. It would be great if there was a solution for our problem.

Otherwise the upgrade path will be blocked for us.

markroos commented 4 years ago

We found that with 1.2.2 the tcpNoDelay state was changed from true to false. This caused our app (1k-10K msg / s dos = 1) to back up and generate lots of memory usage for the un processed messages. We patched the code and things went back to as it was in 1.2.1. We guess that it may be due to the qos handshakes being delayed.

rdasgupt commented 4 years ago

@markroos Thanks for finding the root cause of the problem. We will evaluate this and fix the code accordingly.

FrVaBe commented 4 years ago

@markroos That is interesting but I observed the CPU usage increase when migrating from 1.2.0 to 1.2.1. Not sure if this here is the same issue that you were able to fix when migration code parts back to 1.2.1. Nevertheless thanks for your investigantion.

markroos commented 4 years ago

Our transition went from 1.2.0 directly to 1.2.3 so I have no data on 1.2.1. At the same time we went from qos=0 to qos=1 for most of our messages. We ran into memory use issues similar to what you mention. In addition to the mentioned client change we also had to modify the broker configuration ( mosquitto ) and moved client processing, especially message sends, off the paho client receive thread. We were monitoring the inFlightQueue and saw it growing quickly. We were doing a mqtt send in the messageArrived callback (async) which seemed to be a bad idea.