Closed morph027 closed 6 years ago
Hi,
thank you for your feedback! The behavior you discribe is really unusual. When one or more destinations is disconnected the forwarding plugin does not send them any IPFIX Messages until a separate thread re-estabilish a new connection. I looked quickly into the code, but I didn't find any obvious bug.
To find the bug, I need you to answer a few questions:
1) Are there any other plugins in your startup configuration except an input plugin (TCP, UDP, ...) and the forwarding plugin? If so, these plugins can slow down other output plugins.
2) What <distribution>
mode of the forwarding plugin do you use? ("RoundRobin" or "all")
Can you also try to run ipfixcol with higher verbosity mode ipfixcol -v 2
. The plugin log disconnection and reconnection events that can be really helpful for debugging.
Lukas
Hi! Thanks for answering that quick ;)
startup.xml
<?xml version="1.0" encoding="utf-8"?>
<ipfix xmlns="urn:ietf:params:xml:ns:yang:ietf-ipfix-psamp">
<collectingProcess>
<tcpCollector>
<localIPAddress>127.0.0.1</localIPAddress>
<localPort>4739</localPort>
<name>TCP-4739</name>
</tcpCollector>
<name>TCP-4739</name>
<exportingProcess>EXPORT-FORWARD</exportingProcess>
</collectingProcess>
<exportingProcess>
<destination>
<fileWriter>
<defaultPort>4739</defaultPort>
<distribution>all</distribution>
<destination>
<ip>10.10.10.1</ip>
<protocol>tcp</protocol>
<port>4739</port>
</destination>
<destination>
<ip>10.10.10.2</ip>
<protocol>tcp</protocol>
<port>4739</port>
</destination>
<destination>
<ip>10.10.10.3</ip>
<protocol>tcp</protocol>
<port>4739</port>
</destination>
<fileFormat>forwarding</fileFormat>
</fileWriter>
<name>FORWARD</name>
</destination>
<name>EXPORT-FORWARD</name>
</exportingProcess>
</ipfix>
When starting up ipfixcol, i see messages like:
WARNING: forwarding(dst): Connection to '10.10.10.1:4739' failed.
INFO: output manager: [0] Data Manager created
WARNING: forwarding(dst): Connection to '10.10.10.2:4739' established.
WARNING: forwarding(dst): Connection to '10.10.10.3:4739' established.
Ok...properly shutting down my receiving side seems to be ok:
WARNING: forwarding(sender): Connection to "10.10.10.1:4739" closed (Broken pipe).
...
WARNING: forwarding(dst): Connection to '10.10.10.1:4739' established.
Ok...think i get closer.
Starting with 2 receivers, everything is ok, it detects states properly. When i add a new destination to the xml and do a SIGUSR1, it does not work (even when the destination exists and is listening).
| Queue utilization:
| Preprocessor output queue: 832 / 8192
| Output Manager output queues:
| ODID | waiting / total size
| 0 8191 / 8192
Hm...now, all destinations are running and processing ipfix, it still fills up the queue:
Statistics
| Time: 1523456777, run time: 90
| ODID packets data rec. lost data rec. packets/s data records/s lost data records/s
| 0 3226 3216 0 29 29 0
|
| TID state cpu usage thread name
| 13215 S 0.21 % ipfixcol:TCP
| 13216 S 0.00 % ipfixcol:TCP
| 13217 S 0.11 % ipfixcol OM
| 13218 R 0.00 % ipfixcol:stats
| 13219 S 0.00 % ipfixcol OM
| 13289 S 0.00 % ipfixcol:TCP
| 13399 S 0.00 % ipfixcol:TCP
| 13400 S 0.21 % out:forwarding
|
| Queue utilization:
| Preprocessor output queue: 0 / 8192
| Output Manager output queues:
| ODID | waiting / total size
| 0 2350 / 8192
Statistics
| Time: 1523456787, run time: 100
| ODID packets data rec. lost data rec. packets/s data records/s lost data records/s
| 0 3522 3512 0 29 29 0
|
| TID state cpu usage thread name
| 13215 S 0.32 % ipfixcol:TCP
| 13216 S 0.00 % ipfixcol:TCP
| 13217 S 0.00 % ipfixcol OM
| 13218 R 0.00 % ipfixcol:stats
| 13219 S 0.00 % ipfixcol OM
| 13289 S 0.00 % ipfixcol:TCP
| 13399 S 0.00 % ipfixcol:TCP
| 13400 S 0.42 % out:forwarding
|
| Queue utilization:
| Preprocessor output queue: 0 / 8192
| Output Manager output queues:
| ODID | waiting / total size
| 0 2646 / 8192
Statistics
| Time: 1523456797, run time: 110
| ODID packets data rec. lost data rec. packets/s data records/s lost data records/s
| 0 3897 3887 0 37 37 0
|
| TID state cpu usage thread name
| 13215 S 0.42 % ipfixcol:TCP
| 13216 S 0.00 % ipfixcol:TCP
| 13217 S 0.00 % ipfixcol OM
| 13218 R 0.00 % ipfixcol:stats
| 13219 S 0.00 % ipfixcol OM
| 13289 S 0.00 % ipfixcol:TCP
| 13399 S 0.00 % ipfixcol:TCP
| 13400 S 0.21 % out:forwarding
|
| Queue utilization:
| Preprocessor output queue: 0 / 8192
| Output Manager output queues:
| ODID | waiting / total size
| 0 3021 / 8192
Statistics
| Time: 1523456807, run time: 120
| ODID packets data rec. lost data rec. packets/s data records/s lost data records/s
| 0 4245 4235 0 34 34 0
|
| TID state cpu usage thread name
| 13215 S 0.32 % ipfixcol:TCP
| 13216 S 0.00 % ipfixcol:TCP
| 13217 S 0.00 % ipfixcol OM
| 13218 R 0.00 % ipfixcol:stats
| 13219 S 0.00 % ipfixcol OM
| 13289 S 0.00 % ipfixcol:TCP
| 13399 S 0.00 % ipfixcol:TCP
| 13400 S 0.21 % out:forwarding
|
| Queue utilization:
| Preprocessor output queue: 0 / 8192
| Output Manager output queues:
| ODID | waiting / total size
| 0 3369 / 8192
This is strange, the JSON plugin should be able to process few hundred thousand flows per second. Are you sure that the destinations are consuming the data fast enough? Could you run ss -p -t | grep ipfix
to see whether the Send-Q
is getting full?
Looks like another problem, the TCP session handling seems to handle this pretty well.
Current steps to reproduce:
Statistics
| Time: 1523513902, run time: 70
| ODID packets data rec. lost data rec. packets/s data records/s lost data records/s
| 0 2086 2078 0 30 29 0
|
| TID state cpu usage thread name
| 8246 S 0.21 % ipfixcol:TCP
| 8247 S 0.00 % ipfixcol:TCP
| 8248 S 0.00 % ipfixcol OM
| 8249 R 0.00 % ipfixcol:stats
| 8250 S 0.00 % ipfixcol OM
| 8251 S 0.11 % out:forwarding
|
| Queue utilization:
| Preprocessor output queue: 0 / 8192
| Output Manager output queues:
| ODID | waiting / total size
| 0 0 / 8192
INFO: main: [8246] Starting reconfiguration process
INFO: config: No intermediate plugin configured in user configuration
INFO: configurator: [8246] Closing storage plugin 0 (forwarding)
INFO: storage plugin thread: [0] Closing storage plugin thread
INFO: configurator: [8246] Opening storage xml_conf: /usr/share/ipfixcol/plugins/ipfixcol-forwarding-output.so
INFO: configurator: No profile configuration
WARNING: forwarding(dst): Connection to '10.10.10.1:4737' established.
WARNING: forwarding: Template manager of the plugin doesn't have a template (Template ID: 441) for a Data Set from a source with the ODID 0. Some records will be definitely lost.
WARNING: forwarding: Template manager of the plugin doesn't have a template (Template ID: 441) for a Data Set from a source with the ODID 0. Some records will be definitely lost.
WARNING: forwarding: Template manager of the plugin doesn't have a template (Template ID: 441) for a Data Set from a source with the ODID 0. Some records will be definitely lost.
WARNING: forwarding: Template manager of the plugin doesn't have a template (Template ID: 441) for a Data Set from a source with the ODID 0. Some records will be definitely lost.
WARNING: forwarding: Template manager of the plugin doesn't have a template (Template ID: 441) for a Data Set from a source with the ODID 0. Some records will be definitely lost.
... huge amount of warnings
Statistics
| Time: 1523513912, run time: 80
| ODID packets data rec. lost data rec. packets/s data records/s lost data records/s
| 0 2380 2371 0 29 29 0
|
| TID state cpu usage thread name
| 8246 S 0.21 % ipfixcol:TCP
| 8247 S 0.00 % ipfixcol:TCP
| 8248 S 0.00 % ipfixcol OM
| 8249 R 0.00 % ipfixcol:stats
| 8250 S 0.00 % ipfixcol OM
| 8494 S 0.00 % ipfixcol:TCP
| 8495 S 0.00 % out:forwarding
|
| Queue utilization:
| Preprocessor output queue: 0 / 8192
| Output Manager output queues:
| ODID | waiting / total size
| 0 159 / 8192
WARNING: forwarding: Template manager of the plugin doesn't have a template (Template ID: 441) for a Data Set from a source with the ODID 0. Some records will be definitely lost.
... huge amount of warnings
Statistics
| Time: 1523513922, run time: 90
| ODID packets data rec. lost data rec. packets/s data records/s lost data records/s
| 0 2640 2631 0 26 26 0
|
| TID state cpu usage thread name
| 8246 S 0.32 % ipfixcol:TCP
| 8247 S 0.00 % ipfixcol:TCP
| 8248 S 0.11 % ipfixcol OM
| 8249 R 0.00 % ipfixcol:stats
| 8250 S 0.00 % ipfixcol OM
| 8494 S 0.00 % ipfixcol:TCP
| 8495 S 0.00 % out:forwarding
|
| Queue utilization:
| Preprocessor output queue: 0 / 8192
| Output Manager output queues:
| ODID | waiting / total size
| 0 419 / 8192
WARNING: forwarding: Template manager of the plugin doesn't have a template (Template ID: 441) for a Data Set from a source with the ODID 0. Some records will be definitely lost.
... huge amount of warnings
Statistics
| Time: 1523513932, run time: 100
| ODID packets data rec. lost data rec. packets/s data records/s lost data records/s
| 0 2920 2911 0 28 28 0
|
| TID state cpu usage thread name
| 8246 S 0.32 % ipfixcol:TCP
| 8247 S 0.00 % ipfixcol:TCP
| 8248 R 0.00 % ipfixcol OM
| 8249 R 0.00 % ipfixcol:stats
| 8250 S 0.00 % ipfixcol OM
| 8494 S 0.00 % ipfixcol:TCP
| 8495 S 0.00 % out:forwarding
|
| Queue utilization:
| Preprocessor output queue: 1 / 8192
| Output Manager output queues:
| ODID | waiting / total size
| 0 698 / 8192
WARNING: forwarding: Template manager of the plugin doesn't have a template (Template ID: 441) for a Data Set from a source with the ODID 0. Some records will be definitely lost.
... huge amount of warnings
Statistics
| Time: 1523513942, run time: 110
| ODID packets data rec. lost data rec. packets/s data records/s lost data records/s
| 0 3203 3194 0 28 28 0
|
| TID state cpu usage thread name
| 8246 S 0.21 % ipfixcol:TCP
| 8247 S 0.00 % ipfixcol:TCP
| 8248 S 0.00 % ipfixcol OM
| 8249 R 0.00 % ipfixcol:stats
| 8250 S 0.00 % ipfixcol OM
| 8494 S 0.00 % ipfixcol:TCP
| 8495 S 0.11 % out:forwarding
|
| Queue utilization:
| Preprocessor output queue: 0 / 8192
| Output Manager output queues:
| ODID | waiting / total size
| 0 982 / 8192
Looks like it keeps the messages with unmatched templates somehow?
Note: My sender sends and re-sends all templates when a connection is established (and then every x amount of time)
Note: Restarting instead of reloading is working properly.
Note: When messages are queuing up, it even fails to stop in sane amount of time, need to hard kill it.
# ss -p -t | grep ipfixcol
ESTAB 0 0 127.0.0.1:53812 127.0.0.1:ipdr-sp users:(("ipfixcol:TCP",pid=9162,fd=6))
ESTAB 0 0 127.0.0.1:54034 127.0.0.1:ipdr-sp users:(("ipfixcol:TCP",pid=9162,fd=8))
SYN-SENT 0 1 10.10.0.1:56524 10.10.10.3:ipfix users:(("ipfixcol:TCP",pid=9162,fd=9))
ESTAB 0 0 10.10.0.1:42192 10.10.10.2:ipfix users:(("ipfixcol:TCP",pid=9162,fd=7))
ESTAB 0 0 ::ffff:127.0.0.1:ipfix ::ffff:127.0.0.1:39126 users:(("ipfixcol:TCP",pid=9162,fd=4))
Now I know what your problem is. Reconfiguration process of the collector allows you to add and remove instances of plugins at runtime. However, if a configuration of a running instance is changed, the collector will destroy the instance and start it over. Because the new instance missed templates it is not able to forward flows. Therefore, restart of the collector works correctly (templates are sended by a TCP exporter during initialization of an input plugin) and reconfiguration doesn't. Unfortunately, I'm afraid that there is not an easy solution right now except collector restart. However, the next generation of IPFIXcol (coming probably later this summer) will support runtime reconfiguration of instances too.
I also tried to reproduce your steps so I can confirm the behavior, but even after reconfiguration messages in my queue are normally processed.
Great, thank you for the explanation. It's OK for me to restart the service, so i will go this way for now.
I'll try to double-check if the messages inside the queue are being processed when templates all known again.
Hi!
Thanks for creating this IPFIX tools collection! I'd just wanted to test some things with the forwarder to multiple destinations (tcp), which works really good if they are alive. But i've noticed that if one destination is gone (probably forever or just a very long time), the queues are filling up and the whole process stalls when the queues are full. No other destination receives data anymore. Not sure if this requires some dead man switch and watchdog, which tries to reconnect but discard messages when gone.
Did i just overlooked a configuration key or is this by design?