Kafka compilation failure on ubuntu 16 and throughput issue towards kafka

ResearchIntern98 commented 4 years ago

We can't compile ipfixcol2 with kafka on ubuntu 16 as it shows librdkafka version incompatibility.

We could compile ipfixcol2 with kafka on ubuntu 18, however on running,it produces 17-20K kafka record per second though input ipfix records were much higher.

With previous ipfixcol, on ubuntu 16, it produced 0.1 million kafka record per second, while running the same produced 17-20k kafka record per second on ubuntu 18.

Lukas955 commented 4 years ago

Since I considered librdkafka provided by Ubuntu 16.04 as outdated I decided to require at least version 0.9.3 (usually available almost everywhere). However, if you want to try compile ipfixcol2 on Ubuntu 16.04, just try to remove version number here: https://github.com/CESNET/ipfixcol2/blob/5f8f72e3650d11a8a38c22324e9d4a0d7b918982/src/plugins/output/json/CMakeLists.txt#L20 (nevertheless, it is possible that some API used by the plugin are not available there)

Can you provide more information how are you testing the performance?

Is Kafka running on the same server as the collector?
What version of Kafka broker are you using?
Do you replay IPFIX data to the IPFIXcol2 using ipfixsend2 tool?
How does your configuration look like?

By the way, on Fedora 31, librdkafka 0.11.6 and the latest Kafka broker (2.4.0) running on the same desktop system (i7-7700T), the collector managed to produce >350k records/s.

Lukas955 commented 4 years ago

Note: if you are running JSON plugin with "info" verbosity level, the plugin will print every second addition information about record delivery performance.

<output>
    <name>JSON output</name>
    <plugin>json</plugin>
    <verbosity>info</verbosity>      <!-- only increase verbosity of this plugin -->
    <params>
        ...
    </params>
</output>

ResearchIntern98 commented 4 years ago

Kafka and collector are on different machines. Kafka version 2.11-2.3.0 We are getting some production data towards ipfixcol2. We collect IPFIX data at ipfixcol2 on udp and send it to kafka with json format.

Lukas955 commented 4 years ago

When you describe your issues, one would expect more verbose answers and step-by-step guide. You must be more proactive and provide additional information, which can have impact on performance. I really don't want to spend my valuable time to try all possible situations!

Few more questions:

How do you measure throughput? In my case, I run the plugin with "info" verbosity and see the following output: INFO: JSON output: STATS: successful deliveries: 530338, failures: 0 which means I successfully sent (with confirmation from cluster) > 530k records per second. Please try the same and send results.
For how long did you let IPFIXcol2 to run? I assume that you are fully aware that sending NetFlow/IPFIX over UDP has some issues with templates. If the collector started after exporters it will take some time to get all templates and interpret all flow data (depends on `"template refresh interval" configuration of your exporters).

ResearchIntern98 commented 4 years ago

Hi, Info shows zero failures. But throughput towards Kafka remains 25K records per second We let the ipfixcol run for 2 hours. Template refresh interval was 30 minutes. We calculate number of records by kafka console consumer.

For ipfixcol1, there was discrepancy for throughput on ubuntu 16 & 18.

Lukas955 commented 4 years ago

Hi, I need much more information to help you with this issue:

Can you send me exactly how do you measure performance, so I can try it too. "Copy-paste" command is what I'm looking for.
Can you also provide the collector log with enabled "info" verbosity for the plugin?
Can you send me your XML configuration of the collector?

By the way, 30 minutes template refresh interval is quite long. Since the UDP plugin of the collector consider templates as invalid (by default) after 30 minutes, it is possible that some flow records might be lost here too. I recommend you to change "template refresh interval" settings of your exporters to approximately 10 minutes. However, if you cannot change that adjust parameters of the UDP input plugin, so lifetime values are at least 3x higher than values on your exporters. In your case 5400 seconds (90 minutes):

   <templateLifeTime>5400</templateLifeTime>
   <optionsTemplateLifeTime>5400</optionsTemplateLifeTime>

I'm also curious if you get any warnings/errors (except missing templates + unexpected sequence numbers) when you run the collector with increased verbosity mode: ipfixcol2 -vv -c <config.xml>:

ipfixcol2 -vv -c ~/config.xml | grep -v -e "due to missing (Options) Template" -e "Unexpected Sequence number"

Please, send also this log, if you see any warnings.

Stanzinkdl commented 4 years ago

Hello, In reply to the above query, the collector log with enabled "info" verbosity is provided and also the XML configuration of the collector. Performance is measured by calculating number of flows per second received in the kafka messaging system. I have provided the screenshot of the same with the calculations done. We are receiving approx 20k only in kafka as well as in ipfixcol2 by running it with grep INFO. IPfixcol2 is running on ubuntu 18. Screenshot from 2020-02-12 12-07-22 Screenshot from 2020-02-12 11-58-10

The contents of startup.xml is as follows:

startup xml

Also after running ipfixcol2 -vv -c ~/startup.xml |grep -v -e -e , there occurs exceptions and warnings. Therefore the log file is attached. [

logsofTemplate.log

](url)

This log file shows a UDP warning: The maximum socket receive buffer size is too small (212992 bytes). I have enlarged the size of the buffer and the output log is given below: output1.log

Lukas955 commented 4 years ago

Hi, thank you for the logs, it helps a lot.

The first log (logsofTemplate.log) shows typical situation after collector startup. Templates are not available so the collector is not able to parse any flow records. This situation should be resolved after receiving Templates from exporters. Nevertheless, I think I should add some log aggregation feature for warnings to make output like this one more readable.

The second log (output1.log) is more interesting. In this case, the exporter received some templates and started to parse flow records. If you filter out warnings about missing templates and unexpected sequence numbers, you can see that the collector was successfully sending ~40k records/s to Kafka brokers:

...
INFO: JSON output: STATS: successful deliveries: 21862, failures: 0
...
INFO: JSON output: STATS: successful deliveries: 38517, failures: 0
INFO: JSON output: STATS: successful deliveries: 40104, failures: 0
ERROR: JSON output: rd_kafka_produce() failed: Local: Queue full (1x)
ERROR: JSON output: rd_kafka_produce() failed: Local: Queue full (73784x)
INFO: JSON output: STATS: successful deliveries: 34944, failures: 0
...

Based on sequence number errors I discovered that IPFIX Message are being received out-of-order, which is not problem for the collector at all. On the other hand, it can signalize some kind of network issue. Moreover, in the excerpt of the log above you can see the following error: ERROR: JSON output: rd_kafka_produce() failed: Local: Queue full (73784x), which means that the librdkafka output buffer is full i.e. transfer of kafka message from your server with the collector to kafka brokers is not fast enough.

This gave me an idea that there is a network bottleneck between your infrastructure and the server with IPFIXcol2. You should try to measure throughput from and to the server, packet drop rate, etc.

By the way, am i right that you also tried IPFIXcol (1.generation) on the same machine with very similar results? If so, the bottleneck would explain that too.

Stanzinkdl commented 4 years ago

So why this bottleneck is only on ubuntu18. Why with the same configuration on ubuntu 16 , ipfixcol1 saturates around 120000 kafka record per second. On ubuntu 18, ipfixcol1 saturates at 20-25K record per second.

Lukas955 commented 4 years ago

Are you testing on the same physical server? I think there should not be any difference between Ubuntu 16 and 18, therefore, I recommend you to check physical connectivity of the server to your infrastructure...

Stanzinkdl commented 4 years ago

Yes, i have checked ipfixcol1 and ipfixcol2 on the same server. However, the bottleneck is there is ipfixcol1 as well. While running ipfixcol1 on the same server i got this error message "WARNING: json kafka: maximum number of outstanding messages (100000) has been reached: 'queue.buffering.max.messages' "

Lukas955 commented 4 years ago

The issue is obviously caused by your server or network. It has nothing to do with IPFIXcol or IPFIXcol2.

As I previously said, I recommend you to check network connectivity and throughput between your servers (exporter -> collector and collector -> any broker). For example, try to use tools such as iperf3 (https://www.tecmint.com/test-network-throughput-in-linux/)

ResearchIntern98 commented 4 years ago

Since I considered librdkafka provided by Ubuntu 16.04 as outdated I decided to require at least version 0.9.3 (usually available almost everywhere). However, if you want to try compile ipfixcol2 on Ubuntu 16.04, just try to remove version number here: https://github.com/CESNET/ipfixcol2/blob/5f8f72e3650d11a8a38c22324e9d4a0d7b918982/src/plugins/output/json/CMakeLists.txt#L20

(nevertheless, it is possible that some API used by the plugin are not available there) Can you provide more information how are you testing the performance?

Is Kafka running on the same server as the collector?

What version of Kafka broker are you using?

Do you replay IPFIX data to the IPFIXcol2 using ipfixsend2 tool?

How does your configuration look like?

By the way, on Fedora 31, librdkafka 0.11.6 and the latest Kafka broker (2.4.0) running on the same desktop system (i7-7700T), the collector managed to produce >350k records/s.

As you told ISSUE with compilation with ubuntu 16 In constructor ‘Kafka::Kafka(const cfg_kafka&, ipx_ctx_t)’: /home/ccsm/ipfixcol2-devel/src/plugins/output/json/src/Kafka.cpp:67:28: error: ‘RD_KAFKA_MSG_F_BLOCK’ was not declared in this scope m_produce_flags |= RD_KAFKA_MSG_F_BLOCK; ^ /home/ccsm/ipfixcol2-devel/src/plugins/output/json/src/Kafka.cpp:108:60: error: ‘rd_kafka_last_error’ was not declared in this scope rd_kafka_resp_err_t err_code = rd_kafka_last_error(); ^ /home/ccsm/ipfixcol2-devel/src/plugins/output/json/src/Kafka.cpp: In destructor ‘virtual Kafka::~Kafka()’: /home/ccsm/ipfixcol2-devel/src/plugins/output/json/src/Kafka.cpp:139:52: error: ‘rd_kafka_flush’ was not declared in this scope if (rd_kafka_flush(m_kafka.get(), FLUSH_TIMEOUT) == RD_KAFKA_RESP_ERR__TIMED_OUT) { ^ /home/ccsm/ipfixcol2-devel/src/plugins/output/json/src/Kafka.cpp: In member function ‘virtual int Kafka::process(const char, size_t)’: /home/ccsm/ipfixcol2-devel/src/plugins/output/json/src/Kafka.cpp:170:56: error: ‘rd_kafka_last_error’ was not declared in this scope rd_kafka_resp_err_t err_code = rd_kafka_last_error(); ^ src/plugins/output/json/CMakeFiles/json-output.dir/build.make:182: recipe for target 'src/plugins/output/json/CMakeFiles/json-output.dir/src/Kafka.cpp.o' failed

Lukas955 commented 4 years ago

As I also said, Ubuntu 16.04 contains really outdated version of librdkafka library and some features are missing (in this case, it is support for blocking mode).

Moreover, Ubuntu 16.04 LTS has reached end of its community support in 2019. Current LTS version 18.04, which will be replaced by 20.04 LTS version this April, is fully supported. Therefore, i don't have any plains to add support for outdated Ubuntu version.

Stanzinkdl commented 4 years ago

There seems to be no bottleneck in our server. I have checked the throughput between collector and broker. Its approx 3360k KB ~ 3GB Screenshot from 2020-02-19 10-29-42

thorgrin commented 4 years ago

I've decided to test the issue. Here are my steps and results:

Installation

Using Ubuntu 18.04 LTS, I've created a VM with 16GB RAM and 4 cores.
I've installed Kafka following this howto: https://www.digitalocean.com/community/tutorials/how-to-install-apache-kafka-on-ubuntu-18-04
I've installed libfds from master
I've installed ipfixcol2 from devel

Configuration

I'm using this IPFIXcol2 configuration

<ipfixcol2>
  <inputPlugins>
    <input>
      <name>UDP input</name>
      <plugin>udp</plugin>
      <params>
        <localPort>4739</localPort>
        <localIPAddress>
        </localIPAddress>
      </params>
    </input>
  </inputPlugins>
  <outputPlugins>
    <output>
      <name>JSON output</name>
      <plugin>json</plugin>
      <params>
        <tcpFlags>formatted</tcpFlags>
        <timestamp>formatted</timestamp>
        <protocol>formatted</protocol>
        <ignoreUnknown>true</ignoreUnknown>
        <ignoreOptions>true</ignoreOptions>
        <nonPrintableChar>true</nonPrintableChar>
        <octetArrayAsUint>true</octetArrayAsUint>
        <numericNames>false</numericNames>
        <splitBiflow>false</splitBiflow>
        <detailedInfo>false</detailedInfo>
        <templateInfo>false</templateInfo>
        <outputs>
            <kafka>
                <name>Send to Kafka</name>
                <brokers>127.0.0.1</brokers>
                <topic>ipfix</topic>
                <blocking>false</blocking>
                <partition>unassigned</partition>

                <property>
                    <key>compression.codec</key>
                    <value>lz4</value>
                </property>
            </kafka>
        </outputs>
      </params>
    </output>
  </outputPlugins>
</ipfixcol2>

Testing

To test, I run the following processes, all on the same VM:

ipfixcol2 -c startup.xml
ipfixsend2 -i dump.ipfix <- this contains some ipfix data, there are quite a lot of sequence number warning messages, but that should not be important here
/home/kafka/kafka/bin/kafka-consumer-perf-test.sh --topic ipfix --broker-list localhost:9092 --messages 1000000 --from-latest --threads 2 <- this actually reads the data from kafka and gives me a reading throughput

Results

So, this is what I found:

$ /home/kafka/kafka/bin/kafka-consumer-perf-test.sh --topic ipfix --broker-list localhost:9092 --messages 1000000 --from-latest --threads 2     
start.time, end.time, data.consumed.in.MB, MB.sec, data.consumed.in.nMsg, nMsg.sec, rebalance.time.ms, fetch.time.ms, fetch.MB.sec, fetch.nMsg.sec
2020-02-19 16:32:39:290, 2020-02-19 16:32:49:682, 579.4562, 55.7598, 1000033, 96231.0431, 1582129959764, -1582129949372, -0.0000, -0.0006

$ /home/kafka/kafka/bin/kafka-consumer-perf-test.sh --topic ipfix --broker-list localhost:9092 --messages 1000000 --from-latest --threads 2
start.time, end.time, data.consumed.in.MB, MB.sec, data.consumed.in.nMsg, nMsg.sec, rebalance.time.ms, fetch.time.ms, fetch.MB.sec, fetch.nMsg.sec
2020-02-19 16:33:02:996, 2020-02-19 16:33:12:145, 579.1685, 63.3040, 1000017, 109303.4211, 1582129983470, -1582129974321, -0.0000, -0.0006

$ /home/kafka/kafka/bin/kafka-consumer-perf-test.sh --topic ipfix --broker-list localhost:9092 --messages 1000000 --from-latest --threads 2
start.time, end.time, data.consumed.in.MB, MB.sec, data.consumed.in.nMsg, nMsg.sec, rebalance.time.ms, fetch.time.ms, fetch.MB.sec, fetch.nMsg.sec
2020-02-19 16:35:38:768, 2020-02-19 16:35:49:965, 543.4454, 48.5349, 1000062, 89315.1737, 1582130139363, -1582130128166, -0.0000, -0.0006

What it means is that I can read roughly 100k msg/s from the kafka (min 89315, max 109303).

Sounds good, but I want to check whether it can be improved by speeding up the IPFIXcol2. So, lets update the config a bit and remove some formatting:

18,20c18,20
<         <tcpFlags>formatted</tcpFlags>
<         <timestamp>formatted</timestamp>
<         <protocol>formatted</protocol>
---
>         <tcpFlags>raw</tcpFlags>
>         <timestamp>unix</timestamp>
>         <protocol>raw</protocol>

Rerun the test:

$ /home/kafka/kafka/bin/kafka-consumer-perf-test.sh --topic ipfix --broker-list localhost:9092 --messages 1000000 --from-latest --threads 2
start.time, end.time, data.consumed.in.MB, MB.sec, data.consumed.in.nMsg, nMsg.sec, rebalance.time.ms, fetch.time.ms, fetch.MB.sec, fetch.nMsg.sec
2020-02-19 16:35:59:772, 2020-02-19 16:36:06:134, 545.8337, 85.7959, 1000324, 157234.2031, 1582130160212, -1582130153850, -0.0000, -0.0006

$ /home/kafka/kafka/bin/kafka-consumer-perf-test.sh --topic ipfix --broker-list localhost:9092 --messages 1000000 --from-latest --threads 2
start.time, end.time, data.consumed.in.MB, MB.sec, data.consumed.in.nMsg, nMsg.sec, rebalance.time.ms, fetch.time.ms, fetch.MB.sec, fetch.nMsg.sec
2020-02-19 16:36:13:442, 2020-02-19 16:36:19:471, 546.6914, 90.6770, 1000298, 165914.4137, 1582130173825, -1582130167796, -0.0000, -0.0006

$ /home/kafka/kafka/bin/kafka-consumer-perf-test.sh --topic ipfix --broker-list localhost:9092 --messages 1000000 --from-latest --threads 2
start.time, end.time, data.consumed.in.MB, MB.sec, data.consumed.in.nMsg, nMsg.sec, rebalance.time.ms, fetch.time.ms, fetch.MB.sec, fetch.nMsg.sec
2020-02-19 16:36:34:069, 2020-02-19 16:36:40:583, 546.6848, 83.9246, 1000086, 153528.7074, 1582130194510, -1582130187996, -0.0000, -0.0006

OK, so now we get more than 150k msg/s, which is a considerable improvement. So, although it seems that IPFIXcol2 is the bottleneck here (increasing throughput is noticeable), it is nowhere near the reported 20k msg/s.

Therefore, if you really want to resolve this issue, you'll have to help us a bit.

Repeat the test that I described and confirm the numbers on your IPFIX data. It may be that you have a really special data that are taking forever to process in IPFIXcol2.
Try replace the /home/kafka/kafka/config/server.properties with a configuration that is more like your production one. Does that change the throughput?

If the above steps do not yield any results, you'll have to provide a way for us to reproduce your issue.

Stanzinkdl commented 4 years ago

On ubuntu 18.04.4 LTS, VM with 16GB RAM. Tried kafka and ipfixcol2 on the same server as well as different servers. Using "ipfixcol2 -vv 3 | grep INFO" on the collector, the following screenshot shows the output in kafka broker. By changing the configuration file (startup.xml) there is not much of a difference as you can see in the image. Only command i did'nt used was "ipfixcol2 -c startup.xml" and "ipfixsend2 -i dumpfile.ipfix". Can you tell me what is the use of these two commands and how to produce the ipfix file? However, this is the result i found. It is similar to the tests before i.e., 20k msgs/sec. Screenshot from 2020-02-20 15-22-37

thorgrin commented 4 years ago

I use the ipfixsend2 to replay ipfix data file at high speed so that I can test the throughput without depending on the source. You can generate your own ipfix data file using the ipfix output plugin: https://github.com/CESNET/ipfixcol2/tree/master/src/plugins/output/ipfix#example-configuration

If you could share the data file, that would probably help. Anyway, please try to use it with ipfixsend2 to generate as much data as possible and check again using the kafka-consumer-perf-test script.

Lukas955 commented 4 years ago

By the way, example IPFIX file can be found here.

If possible, increase only verbosity of the JSON plugin. Otherwise, it is possible that printing log messages on the standard output might slowdown the collector.

Stanzinkdl commented 4 years ago

https://drive.google.com/open?id=1RBLd2v3rk7vZ07NU-rjriuMZHMVq7O2W this is the link of the ipfix file

Lukas955 commented 4 years ago

I possibly find the solution. I added extra librdkafka parameters to the plugin configuration: batch.num.messages and linger.ms. In case of my VM (which i created based on thorgin guide), it significantly improved performance from 30k/s to 160k/s.

<kafka>
  <name>Send to Kafka</name>
  <brokers>127.0.0.1</brokers>
  <topic>ipfix</topic>
  <blocking>false</blocking>
  <partition>unassigned</partition>

  <!-- Zero or more additional properties -->
  <property>
      <key>compression.codec</key>
      <value>lz4</value>
  </property>
  <property>
      <key>linger.ms</key>
      <value>100</value>
  </property>
  <property>
      <key>batch.num.messages</key>
      <value>200000</value>
  </property>
</kafka>

thorgrin commented 4 years ago

This improved my speed from 100k msg/s -> 160k msg/s in case of the first configuration and 150k msg/s -> 190k msg/s in the second case.

@Stanzinkdl I cannot read the ipfix file, it seems to be corrupted. It should be possible to read in in wireshark as well, but it fails for me.

thorgrin commented 4 years ago

Ok, so @Lukas955 preprocessed the data for me and I was able to use it in the test. It gives me about 140k msg/s using startup2.xml with the added properties.

Stanzinkdl commented 4 years ago

Thank you @Lukas955 @thorgrin this worked and throughput has increased from 20k to 110k per sec on our system settings. While using ipfixsend2 with the data file provided by you https://github.com/CESNET/ipfixcol2/tree/master/doc/data/ipfix throughput is approx 125k. As you previously said, "on Fedora 31, librdkafka 0.11.6 and the latest Kafka broker (2.4.0) running on the same desktop system (i7-7700T), the collector managed to produce >350k records/s" Is it possible to achieve ~350k records/s on Ubuntu 18 with kafka broker (2.3.0) or the latest 2.4.0?

Lukas955 commented 4 years ago

Great news! If you want to increase throughput, I recommend you few things to do:

Try to experiment with more librdkafka parameters. As you can see, the kafka connector is probably the bottleneck here. You can find all possible configuration parameters here - see table "Global configuration properties" and parameters for the producer (column C/P is "*" or "P").
The JSON plugin also supports optional parameter <brokerVersion> (see the description of the plugin), which can enable feature negotiation between the producer and brokers and might also improve performance. It changes parameters api.version.request and broker.version.fallback based on the library recommendation.
If none of the above works, try to distribute processing of flows. For example, you can run two instances of IPFIXcol2. Each instance is listening on different port and some of your exporters send data to the first collector and the rest to the second one. However, if you don't want to run multiple collectors and your exporters are using different Source ID/ODID, you can try to create multiple output instances of the JSON plugin with ODID filter to internally distribute flows on different kafka connectors, for example:

  <outputPlugins>
    <output>
      <name>First JSON output</name>
      <plugin>json</plugin>
      <odidOnly>0-10</odidOnly>      <!-- ODID filter -->
      <params>
         ...
      </params>
    </output>
    <output>
      <name>Second JSON output</name>
      <plugin>json</plugin>
      <odidExcept>0-10</odidExcept>  <!-- ODID filter -->
      <params>
        ...
      </params>
    </output>
  </outputPlugins>

You can find more information about the ODID filter here. By the way, ipfixsend2 allows you to rewrite ODID of IPFIX Messages (parameter -O num), so you can even try to test throughput in this case.

ResearchIntern98 commented 4 years ago

We could not achieve throughput on fedora31 with above settings. However, we could achieve 250K of throughput on each instance of ipfixcol2 on Centos8.

Lukas955 commented 4 years ago

Thank you for the feedback. Since we solved the throughput problem, I'm closing this issue.

CESNET / ipfixcol2