CESNET / ipfixcol

IPFIXcol is an implementation of an IPFIX (RFC 7011) collector
Other
64 stars 37 forks source link

Forwarder: missing templates on SIGUSR1 queues up messages until it hangs #198

Closed morph027 closed 6 years ago

morph027 commented 6 years ago

Hi!

Thanks for creating this IPFIX tools collection! I'd just wanted to test some things with the forwarder to multiple destinations (tcp), which works really good if they are alive. But i've noticed that if one destination is gone (probably forever or just a very long time), the queues are filling up and the whole process stalls when the queues are full. No other destination receives data anymore. Not sure if this requires some dead man switch and watchdog, which tries to reconnect but discard messages when gone.

Did i just overlooked a configuration key or is this by design?

Lukas955 commented 6 years ago

Hi,

thank you for your feedback! The behavior you discribe is really unusual. When one or more destinations is disconnected the forwarding plugin does not send them any IPFIX Messages until a separate thread re-estabilish a new connection. I looked quickly into the code, but I didn't find any obvious bug.

To find the bug, I need you to answer a few questions: 1) Are there any other plugins in your startup configuration except an input plugin (TCP, UDP, ...) and the forwarding plugin? If so, these plugins can slow down other output plugins. 2) What <distribution> mode of the forwarding plugin do you use? ("RoundRobin" or "all")

Can you also try to run ipfixcol with higher verbosity mode ipfixcol -v 2. The plugin log disconnection and reconnection events that can be really helpful for debugging.

Lukas

morph027 commented 6 years ago

Hi! Thanks for answering that quick ;)

startup.xml

<?xml version="1.0" encoding="utf-8"?>
<ipfix xmlns="urn:ietf:params:xml:ns:yang:ietf-ipfix-psamp">
    <collectingProcess>
        <tcpCollector>
            <localIPAddress>127.0.0.1</localIPAddress>
            <localPort>4739</localPort>
            <name>TCP-4739</name>
        </tcpCollector>
        <name>TCP-4739</name>
        <exportingProcess>EXPORT-FORWARD</exportingProcess>
    </collectingProcess>
    <exportingProcess>
        <destination>
            <fileWriter>
                <defaultPort>4739</defaultPort>
                <distribution>all</distribution>
                <destination>
                    <ip>10.10.10.1</ip>
                    <protocol>tcp</protocol>
                    <port>4739</port>
                </destination>
                <destination>
                    <ip>10.10.10.2</ip>
                    <protocol>tcp</protocol>
                    <port>4739</port>
                </destination>
                <destination>
                    <ip>10.10.10.3</ip>
                    <protocol>tcp</protocol>
                    <port>4739</port>
                </destination>
                <fileFormat>forwarding</fileFormat>
            </fileWriter>
            <name>FORWARD</name>
        </destination>
        <name>EXPORT-FORWARD</name>
    </exportingProcess>
</ipfix>

When starting up ipfixcol, i see messages like:

WARNING: forwarding(dst): Connection to '10.10.10.1:4739' failed.
INFO: output manager: [0] Data Manager created
WARNING: forwarding(dst): Connection to '10.10.10.2:4739' established.
WARNING: forwarding(dst): Connection to '10.10.10.3:4739' established.
morph027 commented 6 years ago

Ok...properly shutting down my receiving side seems to be ok:

WARNING: forwarding(sender): Connection to "10.10.10.1:4739" closed (Broken pipe).
...
WARNING: forwarding(dst): Connection to '10.10.10.1:4739' established.
morph027 commented 6 years ago

Ok...think i get closer.

Starting with 2 receivers, everything is ok, it detects states properly. When i add a new destination to the xml and do a SIGUSR1, it does not work (even when the destination exists and is listening).

 | Queue utilization:
 |     Preprocessor output queue: 832 / 8192
 |     Output Manager output queues:
 |         ODID | waiting / total size
 |            0      8191 / 8192
morph027 commented 6 years ago

Hm...now, all destinations are running and processing ipfix, it still fills up the queue:

Statistics
 | Time: 1523456777, run time: 90
 |       ODID         packets       data rec.  lost data rec.       packets/s  data records/s  lost data records/s
 |          0            3226            3216               0              29              29                    0
 |
 |        TID   state  cpu usage     thread name
 |      13215       S     0.21 %    ipfixcol:TCP
 |      13216       S     0.00 %    ipfixcol:TCP
 |      13217       S     0.11 %     ipfixcol OM
 |      13218       R     0.00 %  ipfixcol:stats
 |      13219       S     0.00 %     ipfixcol OM
 |      13289       S     0.00 %    ipfixcol:TCP
 |      13399       S     0.00 %    ipfixcol:TCP
 |      13400       S     0.21 % out:forwarding 
 |
 | Queue utilization:
 |     Preprocessor output queue: 0 / 8192
 |     Output Manager output queues:
 |         ODID | waiting / total size
 |            0      2350 / 8192
Statistics
 | Time: 1523456787, run time: 100
 |       ODID         packets       data rec.  lost data rec.       packets/s  data records/s  lost data records/s
 |          0            3522            3512               0              29              29                    0
 |
 |        TID   state  cpu usage     thread name
 |      13215       S     0.32 %    ipfixcol:TCP
 |      13216       S     0.00 %    ipfixcol:TCP
 |      13217       S     0.00 %     ipfixcol OM
 |      13218       R     0.00 %  ipfixcol:stats
 |      13219       S     0.00 %     ipfixcol OM
 |      13289       S     0.00 %    ipfixcol:TCP
 |      13399       S     0.00 %    ipfixcol:TCP
 |      13400       S     0.42 % out:forwarding 
 |
 | Queue utilization:
 |     Preprocessor output queue: 0 / 8192
 |     Output Manager output queues:
 |         ODID | waiting / total size
 |            0      2646 / 8192
Statistics
 | Time: 1523456797, run time: 110
 |       ODID         packets       data rec.  lost data rec.       packets/s  data records/s  lost data records/s
 |          0            3897            3887               0              37              37                    0
 |
 |        TID   state  cpu usage     thread name
 |      13215       S     0.42 %    ipfixcol:TCP
 |      13216       S     0.00 %    ipfixcol:TCP
 |      13217       S     0.00 %     ipfixcol OM
 |      13218       R     0.00 %  ipfixcol:stats
 |      13219       S     0.00 %     ipfixcol OM
 |      13289       S     0.00 %    ipfixcol:TCP
 |      13399       S     0.00 %    ipfixcol:TCP
 |      13400       S     0.21 % out:forwarding 
 |
 | Queue utilization:
 |     Preprocessor output queue: 0 / 8192
 |     Output Manager output queues:
 |         ODID | waiting / total size
 |            0      3021 / 8192
Statistics
 | Time: 1523456807, run time: 120
 |       ODID         packets       data rec.  lost data rec.       packets/s  data records/s  lost data records/s
 |          0            4245            4235               0              34              34                    0
 |
 |        TID   state  cpu usage     thread name
 |      13215       S     0.32 %    ipfixcol:TCP
 |      13216       S     0.00 %    ipfixcol:TCP
 |      13217       S     0.00 %     ipfixcol OM
 |      13218       R     0.00 %  ipfixcol:stats
 |      13219       S     0.00 %     ipfixcol OM
 |      13289       S     0.00 %    ipfixcol:TCP
 |      13399       S     0.00 %    ipfixcol:TCP
 |      13400       S     0.21 % out:forwarding 
 |
 | Queue utilization:
 |     Preprocessor output queue: 0 / 8192
 |     Output Manager output queues:
 |         ODID | waiting / total size
 |            0      3369 / 8192
thorgrin commented 6 years ago

This is strange, the JSON plugin should be able to process few hundred thousand flows per second. Are you sure that the destinations are consuming the data fast enough? Could you run ss -p -t | grep ipfix to see whether the Send-Q is getting full?

morph027 commented 6 years ago

Looks like another problem, the TCP session handling seems to handle this pretty well.

Current steps to reproduce:

Statistics
 | Time: 1523513902, run time: 70
 |       ODID         packets       data rec.  lost data rec.       packets/s  data records/s  lost data records/s
 |          0            2086            2078               0              30              29                    0
 |
 |        TID   state  cpu usage     thread name
 |       8246       S     0.21 %    ipfixcol:TCP
 |       8247       S     0.00 %    ipfixcol:TCP
 |       8248       S     0.00 %     ipfixcol OM
 |       8249       R     0.00 %  ipfixcol:stats
 |       8250       S     0.00 %     ipfixcol OM
 |       8251       S     0.11 % out:forwarding 
 |
 | Queue utilization:
 |     Preprocessor output queue: 0 / 8192
 |     Output Manager output queues:
 |         ODID | waiting / total size
 |            0         0 / 8192
INFO: main: [8246] Starting reconfiguration process
INFO: config: No intermediate plugin configured in user configuration
INFO: configurator: [8246] Closing storage plugin 0 (forwarding)
INFO: storage plugin thread: [0] Closing storage plugin thread
INFO: configurator: [8246] Opening storage xml_conf: /usr/share/ipfixcol/plugins/ipfixcol-forwarding-output.so
INFO: configurator: No profile configuration
WARNING: forwarding(dst): Connection to '10.10.10.1:4737' established.
WARNING: forwarding: Template manager of the plugin doesn't have a template (Template ID: 441) for a Data Set from a source with the ODID 0. Some records will be definitely lost.
WARNING: forwarding: Template manager of the plugin doesn't have a template (Template ID: 441) for a Data Set from a source with the ODID 0. Some records will be definitely lost.
WARNING: forwarding: Template manager of the plugin doesn't have a template (Template ID: 441) for a Data Set from a source with the ODID 0. Some records will be definitely lost.
WARNING: forwarding: Template manager of the plugin doesn't have a template (Template ID: 441) for a Data Set from a source with the ODID 0. Some records will be definitely lost.
WARNING: forwarding: Template manager of the plugin doesn't have a template (Template ID: 441) for a Data Set from a source with the ODID 0. Some records will be definitely lost.
... huge amount of warnings
Statistics
 | Time: 1523513912, run time: 80
 |       ODID         packets       data rec.  lost data rec.       packets/s  data records/s  lost data records/s
 |          0            2380            2371               0              29              29                    0
 |
 |        TID   state  cpu usage     thread name
 |       8246       S     0.21 %    ipfixcol:TCP
 |       8247       S     0.00 %    ipfixcol:TCP
 |       8248       S     0.00 %     ipfixcol OM
 |       8249       R     0.00 %  ipfixcol:stats
 |       8250       S     0.00 %     ipfixcol OM
 |       8494       S     0.00 %    ipfixcol:TCP
 |       8495       S     0.00 % out:forwarding 
 |
 | Queue utilization:
 |     Preprocessor output queue: 0 / 8192
 |     Output Manager output queues:
 |         ODID | waiting / total size
 |            0       159 / 8192
WARNING: forwarding: Template manager of the plugin doesn't have a template (Template ID: 441) for a Data Set from a source with the ODID 0. Some records will be definitely lost.
... huge amount of warnings
Statistics
 | Time: 1523513922, run time: 90
 |       ODID         packets       data rec.  lost data rec.       packets/s  data records/s  lost data records/s
 |          0            2640            2631               0              26              26                    0
 |
 |        TID   state  cpu usage     thread name
 |       8246       S     0.32 %    ipfixcol:TCP
 |       8247       S     0.00 %    ipfixcol:TCP
 |       8248       S     0.11 %     ipfixcol OM
 |       8249       R     0.00 %  ipfixcol:stats
 |       8250       S     0.00 %     ipfixcol OM
 |       8494       S     0.00 %    ipfixcol:TCP
 |       8495       S     0.00 % out:forwarding 
 |
 | Queue utilization:
 |     Preprocessor output queue: 0 / 8192
 |     Output Manager output queues:
 |         ODID | waiting / total size
 |            0       419 / 8192
WARNING: forwarding: Template manager of the plugin doesn't have a template (Template ID: 441) for a Data Set from a source with the ODID 0. Some records will be definitely lost.
... huge amount of warnings
Statistics
 | Time: 1523513932, run time: 100
 |       ODID         packets       data rec.  lost data rec.       packets/s  data records/s  lost data records/s
 |          0            2920            2911               0              28              28                    0
 |
 |        TID   state  cpu usage     thread name
 |       8246       S     0.32 %    ipfixcol:TCP
 |       8247       S     0.00 %    ipfixcol:TCP
 |       8248       R     0.00 %     ipfixcol OM
 |       8249       R     0.00 %  ipfixcol:stats
 |       8250       S     0.00 %     ipfixcol OM
 |       8494       S     0.00 %    ipfixcol:TCP
 |       8495       S     0.00 % out:forwarding 
 |
 | Queue utilization:
 |     Preprocessor output queue: 1 / 8192
 |     Output Manager output queues:
 |         ODID | waiting / total size
 |            0       698 / 8192
WARNING: forwarding: Template manager of the plugin doesn't have a template (Template ID: 441) for a Data Set from a source with the ODID 0. Some records will be definitely lost.
... huge amount of warnings
Statistics
 | Time: 1523513942, run time: 110
 |       ODID         packets       data rec.  lost data rec.       packets/s  data records/s  lost data records/s
 |          0            3203            3194               0              28              28                    0
 |
 |        TID   state  cpu usage     thread name
 |       8246       S     0.21 %    ipfixcol:TCP
 |       8247       S     0.00 %    ipfixcol:TCP
 |       8248       S     0.00 %     ipfixcol OM
 |       8249       R     0.00 %  ipfixcol:stats
 |       8250       S     0.00 %     ipfixcol OM
 |       8494       S     0.00 %    ipfixcol:TCP
 |       8495       S     0.11 % out:forwarding 
 |
 | Queue utilization:
 |     Preprocessor output queue: 0 / 8192
 |     Output Manager output queues:
 |         ODID | waiting / total size
 |            0       982 / 8192

Looks like it keeps the messages with unmatched templates somehow?

Note: My sender sends and re-sends all templates when a connection is established (and then every x amount of time)

Note: Restarting instead of reloading is working properly.

Note: When messages are queuing up, it even fails to stop in sane amount of time, need to hard kill it.

morph027 commented 6 years ago
# ss -p -t | grep ipfixcol
ESTAB      0      0      127.0.0.1:53812                127.0.0.1:ipdr-sp               users:(("ipfixcol:TCP",pid=9162,fd=6))
ESTAB      0      0      127.0.0.1:54034                127.0.0.1:ipdr-sp               users:(("ipfixcol:TCP",pid=9162,fd=8))
SYN-SENT   0      1      10.10.0.1:56524                10.10.10.3:ipfix                users:(("ipfixcol:TCP",pid=9162,fd=9))
ESTAB      0      0      10.10.0.1:42192                10.10.10.2:ipfix                users:(("ipfixcol:TCP",pid=9162,fd=7))
ESTAB      0      0      ::ffff:127.0.0.1:ipfix         ::ffff:127.0.0.1:39126          users:(("ipfixcol:TCP",pid=9162,fd=4))
Lukas955 commented 6 years ago

Now I know what your problem is. Reconfiguration process of the collector allows you to add and remove instances of plugins at runtime. However, if a configuration of a running instance is changed, the collector will destroy the instance and start it over. Because the new instance missed templates it is not able to forward flows. Therefore, restart of the collector works correctly (templates are sended by a TCP exporter during initialization of an input plugin) and reconfiguration doesn't. Unfortunately, I'm afraid that there is not an easy solution right now except collector restart. However, the next generation of IPFIXcol (coming probably later this summer) will support runtime reconfiguration of instances too.

I also tried to reproduce your steps so I can confirm the behavior, but even after reconfiguration messages in my queue are normally processed.

morph027 commented 6 years ago

Great, thank you for the explanation. It's OK for me to restart the service, so i will go this way for now.

I'll try to double-check if the messages inside the queue are being processed when templates all known again.