Closed robcowart closed 3 years ago
@dreamtalen do you think you could take a look?
do you think you could take a look?
Sure, I will try. @robcowart Could you please share me the steps of capturing this packet with wrong start and stop time?
The environment is a new lab, consisting of...
kubeadm init --pod-network-cidr=10.244.0.0/16
Antrea was deployed by...
https://github.com/vmware-tanzu/antrea/releases/download/v0.10.1/antrea.yml
antrea-agent.conf
changing the following values...AntreaProxy: true
FlowExporter: true
flowCollectorAddr: "192.168.9.21:4738:udp" <--- the port is different than the standard 4739 and is on purpose
flowPollInterval: "5s"
flowExportFrequency: 12
kubectl apply -f antrea.yml
That's it. Just a really simple starting point. When setup in this manner, the dates I am receiving are as shown above.
Interestingly ExportTime: 1603209328
is correct, which shows that it isn't the nodes themselves. I suspect that the issue is related to a missed or improper conversion happening somewhere. flowStartSeconds
and flowEndSeconds
should be unsigned values according to the IPFIX specifications. However Golang's time library uses signed values. Personally I would consider converting the timestamps to uint64
values when they are first fetched and then handling them as uint64
throughout the rest of the code rather than Time
types. Just my $.02
It may be completely unrelated to this issue, but I have also noticed that when flow export is enabled, the antrea-agent will consume 100% of a CPU core. It returns to normal after disabling flow export.
I deployed a cluster with Antrea and the same config as you (except for flowCollectorAddr
of course), but didn't observe high CPU usage. I even deployed some Pods to generate a lot of iperf traffic, but Antrea resource usage stayed pretty low:
110208 systemd+ 20 0 13076 584 544 S 62.8 0.0 0:08.93 iperf
110326 systemd+ 20 0 13076 604 564 S 55.8 0.0 0:05.23 iperf
108606 systemd+ 20 0 24064 3548 608 S 52.8 0.0 0:43.18 iperf
110099 systemd+ 20 0 13076 524 484 S 15.6 0.0 0:06.36 iperf
6092 root 20 0 1340004 98204 67632 S 4.0 1.2 4:19.20 kubelet
6117 root 20 0 1345216 111984 40696 S 3.0 1.4 2:55.19 dockerd
16 root 20 0 0 0 0 S 1.7 0.0 0:25.35 ksoftirqd/1
108006 root 20 0 970844 52520 35500 S 1.0 0.7 0:01.97 antrea-agent
6171 root 20 0 1391676 48896 25156 S 0.7 0.6 0:32.51 containerd
95107 ubuntu 20 0 4825808 1.2g 28968 S 0.7 16.1 0:37.64 java
I generated a lot of traffic (but not that many connections). I left it running for a while but things were stable. Is there anything useful in the antrea-agent logs? Anything specific I should be doing to reproduce?
On a side not we should probably have a mechanism to enable Go profiling and the pprof HTTP server in the Antrea components so that when a user runs into an issue like this one, it is easy to collect information (goroutine backtrace, CPU profile, etc.). I'll open an issue to this effect.
@antoninbas I am going to walk through the whole process again and will document each step for repeatability. If the CPU utilization remains I feel like I should open a separate issue unless you feel it should stay here.
Hi Rob, I failed to reproduce the wrong start and stop timestamps on my side. I'm not sure if it's related to differenct IPFIX collectors we are using. Would you please try our ELK Flow Collector at https://github.com/vmware-tanzu/antrea/blob/master/docs/network-flow-visibility.md? There will be a Kibana dashboard which is easy to use and recommeneded for Antrea flow visualization. Also, please tell me the tool you used to catch this IPFIX record so I can have a try too. Thanks!
@robcowart Thanks! Yes please open a separate issue when you get the steps documented.
@dreamtalen @zyiou could we take a deeper look at the timestamp thing? I know that the timestamps look correct (at least flowStartSeconds
does) when looking at the records using the ELK stack configuration we provide. But there is something fishy when looking at. the records in Wireshark. https://tools.ietf.org/html/rfc7011#section-6.1.7 does state:
The dateTimeSeconds data type is an unsigned 32-bit integer in network byte order containing the number of seconds since the UNIX epoch, 1 January 1970 at 00:00 UTC, as defined in [POSIX.1]. dateTimeSeconds is encoded identically to the IPFIX Message Header Export Time field. It can represent dates between 1 January 1970 and 7 February 2106 without wraparound; see Section 5.2 for wraparound considerations.
Whereas in our case we are using a signed 64-bit integer:
The RFC says that the encoding should be the same as the ExportTime field, for which we use a uint32
(https://github.com/vmware/go-ipfix/blob/9ed57c1a11c1df64715bc23ee9ff4d7730d148a1/pkg/entities/message.go#L29). So I think there is definitely something wrong here. Please take a look. And thanks @robcowart for finding this.
The collector is irrelevant in this case as I was seeing this on the wire via a PCAP before it reached the collector. Seeing it also in the collector just confirms what was seen on the wire.
Can you share anything else about your environment? As I mentioned above I am on Ubuntu 20.04, K8S 1.19.3 via kubeadm init
. How does this compare to what you are testing?
I will walk through my environment setup from scratch again, but I will be a few days.
@antoninbas I remembered that flowStartSeconds
and flowEndSeconds
should be uint32
initially, but we manually modified the data type to int64
when matching logstash fields and exporter fields. @srikartati I will change them back to uint32
if there is no further concern. Thanks!
Thanks @robcowart for raising this issue.
Currently, we consider uint 64-bit datatype for dateTimeSeconds, which is a bug. https://github.com/vmware/go-ipfix/blob/master/pkg/entities/ie.go#L370
As pointed correctly by @antoninbas, according to RFC, dateTimeSeconds should be unsigned 32 bit, but dateTimeMilliseconds still be unsigned 64 bit. The bug is in the format for dateTimeSeconds--should be changed to uint32 type and size should be 4 bytes. I think the confusion was because of int64 format for unix seconds in go time package. @zyiou Agreed. Let's fix this in go-ipfix and do a patch release. Then, we should fix this in Antrea. Please change the type for logstash field too accordingly. Thanks.
In addition, I concur with Rob's suggestion that we should let the user of go-ipfix lib give directly in uint32 for dateTimeSeconds and uint64 for dateTimeMilliseconds rather than expecting in time.Unix() format that is provided in go time package.
An orthogonal point: Right now, flowEndTime is not properly encoded in the flow record. Conntrack flows do not add the stopTime until the conntrack flow is deleted. https://github.com/torvalds/linux/blob/9ff9b0d392ea08090cd1780fb196f36dbb586529/net/netfilter/nf_conntrack_core.c#L664
Currently, we consider stopTime as flowEndTime for every flow record of conntrack flow, so we see some arbitrary values. One option is to use the current time as flowEndTime for every flow record . Second option is to encode it as 0 in each flow record when the flow is active and put the time stamp in the last flow record when the flow becomes inactive, that is when the flow is deleted from conntrack table.
@antoninbas @robcowart Any suggestions?
I would suggest using 0 to make it clear that the value is not valid yet. I couldn't find any guidance for flowEnd*
fields in the IPFIX RFCs, though I may have missed it. On the one hand, it makes you wonder whether you should only export data after a flow "terminates". On the other hand, the existence of fields like octetDeltaCount
would imply that it is not the case and that exporting information half-way through a flow's lifetime is ok.
BTW, a related question: when a flow ends, are we guaranteed to export correct final packet / byte count values? Are does this guarantee only hold if the polling interval is less than the time-wait delay (120s by default IIRC)? Have we considered listening to conntrack events as an alternative / complementary implementation to polling, at least for Linux?
Most network devices send flow records based on one or more configurable timeout settings. While they will track a flow for a longer period of time, they will send flows more frequently. This is necessary for a number of reasons.
Many server-to-server TCP connections can live for weeks, months or longer (e.g. the connection between Kibana or Logstash to Elasticsearch). They essentially never end unless stopped or there is some error. Without a periodic update the user could literally wait months for the information. What good is the rest of their data until the picture is complete? Not much.
Now consider that the above long lived flow was a hacker exfiltrating data from your database. The flow record that could reveal their activity is never sent until the damage is already done, and you loose the opportunity to shut them down. Not good.
If all of the information about a flow record is included at the end of the flow. How do you distribute it over the life of the flow. All at the beginning? All at the end? Unfortunately most visualization tools, like Kibana, and even many purpose built flow applications, do not provide the ability to smooth the data into buckets over the life of the flow. Even if they did, each query would still have to look into the future to determine if any flows had start times within the current visualized window. This is very tricky at best, if not close to impossible. However when the device send regular updates about the flows it is much easier.
If I explained those well, you should see that it is very desirable to send information about a long-lived flow over the life of the flow. Most network devices will provide various options to control this. Some are simple:
inactive_timeout = 15
- export flow after it is inactive for 15 seconds.
active_timeout = 60
- export flow every 60 seconds while it is active.
Others provide more granular control:
expiry-interval 10
- the interval between expiry checks, i.e. every 10 secs it is checked which flows are ready for export
icmp 60
- export icmp flows every 60 seconds
tcp-generic 60
- export tcp flows every 60 seconds
tcp-fin 20
- export flow 20 seconds after observing a TCP finish flag.
tcp-rst 20
- export flow 20 seconds after observing a TCP reset flag.
udp 60
- export udp flows every 60 seconds
flow-generic 60
- applies to any IP protocol not specified by other timeouts
max-active-life 604800
- track no flows longer than 1 week (although it can pickup tracking again if the session is still active, usually with a new flow ID)
I explain this difference just to show the options that are common. The expectation is that it should be possible to at least provide for the former. The only detail I would add is that the inactive and active periods should be tracked per flow, not globally. For example, if the 60 second export period was global, i.e. every 60 seconds all active flows are exported. You can end up with 59 seconds of no data, and a flood of records from across the whole infrastructure simultaneously. This can easily overwhelm some collection systems, especially Linux systems with default networking kernel parameters. Tracking the inactive and active timeouts per flow will better distribute the load for the collecting system and provide more timely and accurate reporting. In 2020 I would say that it is generally accepted that 1-min is a good compromise between data granularity and overall volume of records generated, with short-lived flows being exported quickly due to the inactive timeout.
Now that we have established that records should be sent periodically over the lifetime of a network flow, the next question is which fields, or "information elements" (IEs), should be included. While there are both delta and total IEs for things like bytes and packets. Almost every vendor sends only the delta values, where the delta is the quantity since the previous record for this flow was exported (or since the flow started if sending the initial record).
Sending only delta values is usually not an issue as it is easy enough at query time to do a sum of the deltas, when a total is needed. I have seen a few examples of vendors sending the total values as well, but it isn't really important. What should be avoided is sending only total values. That makes it very challenging to work with the data later, especially in combination with records form other sources that provide only delta values. There is nothing wrong with including the total values, but I would simply remove them and gain a bit of efficiency.
Regarding the start and end timestamps. I have to admit that the RFCs aren't really clear here. My feeling is that for active flows flowEndSeconds
should be set the time that the last measurement was taken. However I am going to spend some time going through my collection of PCAPs from different vendors and try to determine if there is any consensus best practice. I will get back to you on this.
The last thing I will mention is UDP vs. TCP. I noticed that in the Antrea repo examples that the flowexporter was configured to use TCP. However I set mine up to use UDP. Like many other network-related data sources (SNMP, syslog, etc.) netflow was designed to be sent via UDP. The logic of this is that establishing a TCP session has much more overhead and requires much more network traffic (especially in the reverse direction for ACKs and such) than UDP. During an outage, it is generally undesirable that the overhead of the management traffic attempting to re-establish session is competing with applications to do the same. This is so embedded in the networking mindset that many devices, as well as collectors, do not even support TCP. In fact IPFIX is the only flow standard to even specify optional support for TCP. Since the Antrea exporter supports both, this isn't an issue. However you should be aware that if people follow the documented example config exactly, some of them might encounter issues because the collector doesn't support TCP.
Hopefully this was helpful. Let me know if there are any questions that I can help answer.
@srikartati @zyiou @dreamtalen could we investigate the possibility of implementing inactive_timeout
and active_timeout
as @robcowart suggested to limit burstiness and maybe have a more compliant implementation? As part of this it may be interesting to add flowEndReason
to the records (https://www.iana.org/assignments/ipfix/ipfix.xhtml#ipfix-flow-end-reason), to indicate whether the record was sent because of the active_timeout
, inactive_timeout
, or something else. Looking at flowEndReason
, it seems that flowEndSeconds
should indeed be the time at which the flow was last measured to be active.
Thanks @robcowart for detailed comments.
@antoninbas Yes, an IPFIX field based on status flag in polled connections was in plans to be implemented, which is similar to flowEndReason
. active_timeut
and inactive_timeout
suggestion is definitely interesting--currently we have only export frequency for all flows. We do not differentiate active and inactive flows.
Considering flowEndSeconds
as the time when the last polling done before exporting the flow make sense.
BTW, a related question: when a flow ends, are we guaranteed to export correct final packet / byte count values? Are does this guarantee only hold if the polling interval is less than the time-wait delay (120s by default IIRC)? Have we considered listening to conntrack events as an alternative / complementary implementation to polling, at least for Linux?
This is a good point. Currently, we do not have any maximum for polling interval, so the final counts are accurate if it is less than the time wait delay. Regarding event based tracking of connections, there are some ideas to use ebf scripts to get the flow information in addition to polling. We do not have any concrete proposal for now.
Closes this issue (fixed in PR #1479)
@zyiou, @srikartati this issue should be reopened. I have tested v0.11.1 and while the issue is partially fixed, there is still an issue. The flow start time is now correct. However the flow end time is still way in the future. It is not as far in the future as when I originally reported this issue. However it is still wrong.
Cisco NetFlow/IPFIX
Version: 10
Length: 184
Timestamp: Nov 22, 2020 12:27:29.000000000 CET
FlowSequence: 2
Observation Domain Id: 4241174506
Set 1 [id=256] (1 flows)
FlowSet Id: (Data) (256)
FlowSet Length: 168
[Template Frame: 36]
Flow 1
[Duration: 682868247.000000000 seconds (seconds)]
StartTime: Nov 22, 2020 12:26:33.000000000 CET
EndTime: Jul 14, 2042 03:04:00.000000000 CEST
SrcAddr: 10.244.2.2
DstAddr: 192.168.8.11
SrcPort: 43058
DstPort: 6443
Protocol: TCP (6)
Permanent Packets: 85
Permanent Octets: 6565
Packets: 0
Octets: 0
Permanent Packets: 88 (Reverse Type 86 PACKETS_TOTAL)
Permanent Octets: 49145 (Reverse Type 85 BYTES_TOTAL)
Packets: 0 (Reverse Type 2 PKTS)
Octets: 0 (Reverse Type 1 BYTES)
Enterprise Private entry: ((Unknown)) Type 101: Value (hex bytes): 63 6f 72 65 64 6e 73 2d 66 39 66 64 39 37 39 64 36 2d 72 62 67 6c 78
Enterprise Private entry: ((Unknown)) Type 100: Value (hex bytes): 6b 75 62 65 2d 73 79 73 74 65 6d
Enterprise Private entry: ((Unknown)) Type 104: Value (hex bytes): 6b 6e 6f 64 65 32
Enterprise Private entry: ((Unknown)) Type 103: Value (hex bytes):
Enterprise Private entry: ((Unknown)) Type 102: Value (hex bytes):
Enterprise Private entry: ((Unknown)) Type 105: Value (hex bytes):
Enterprise Private entry: ((Unknown)) Type 106: Value (hex bytes): 0a 60 00 01
Enterprise Private entry: ((Unknown)) Type 109: Value (hex bytes): 64 65 66 61 75 6c 74 2f 6b 75 62 65 72 6e 65 74 65 73 3a 68 74 74 70 73
Enterprise Private entry: ((Unknown)) Type 110: Value (hex bytes):
Enterprise Private entry: ((Unknown)) Type 111: Value (hex bytes):
Enterprise Private entry: ((Unknown)) Type 112: Value (hex bytes):
Enterprise Private entry: ((Unknown)) Type 113: Value (hex bytes):
@robcowart You are right. Just checked the code. We are getting the stoptime from conntrack table, instead we should record the stopTime as the time when the flow record is created (time.Now) from the discussion in this issue. https://github.com/vmware-tanzu/antrea/blob/master/pkg/agent/flowexporter/exporter/exporter.go#L252
@zyiou Is it possible for you to add this in one of the PRs you already opened? Thanks.
Sure. I can add this in #1582 when I finished the upgrade to go-ipfix v0.3.1
I have tested v0.12.0
and there are still issues. Previously (see my comment from 22 Nov 2020) the "start" time was fixed, but the "end" time was wrong. Now the opposite is true. As you can in the PCAP below... "end" is now correct, but "start" is again wrong. If one of the two had to be wrong, it is most important that the end time is correct. However it would be best for both to be correct.
Cisco NetFlow/IPFIX
Version: 10
Length: 186
Timestamp: Jan 14, 2021 05:35:33.000000000 CET
FlowSequence: 6
Observation Domain Id: 4241174507
Set 1 [id=256] (1 flows)
FlowSet Id: (Data) (256)
FlowSet Length: 170
[Template Frame: 407]
Flow 1
[Duration: -678313707.000000000 seconds (seconds)]
StartTime: Jul 14, 2042 03:04:00.000000000 CEST
EndTime: Jan 14, 2021 05:35:33.000000000 CET
SrcPort: 57442
DstPort: 6443
Protocol: TCP (6)
Permanent Packets: 0
Permanent Octets: 0
Packets: 0
Octets: 0
SrcAddr: 10.244.3.2
DstAddr: 192.168.8.11
Permanent Packets: 0 (Reverse Type 86 PACKETS_TOTAL)
Permanent Octets: 0 (Reverse Type 85 BYTES_TOTAL)
Packets: 0 (Reverse Type 2 PKTS)
Octets: 0 (Reverse Type 1 BYTES)
Enterprise Private entry: (Project Antrea) Type 101: Value (hex bytes): 63 6f 72 65 64 6e 73 2d 66 39 66 64 39 37 39 64 36 2d 63 70 36 66 62
Enterprise Private entry: (Project Antrea) Type 100: Value (hex bytes): 6b 75 62 65 2d 73 79 73 74 65 6d
Enterprise Private entry: (Project Antrea) Type 104: Value (hex bytes): 6b 6e 6f 64 65 33
Enterprise Private entry: (Project Antrea) Type 103: Value (hex bytes):
Enterprise Private entry: (Project Antrea) Type 102: Value (hex bytes):
Enterprise Private entry: (Project Antrea) Type 105: Value (hex bytes):
Enterprise Private entry: (Project Antrea) Type 108: Value (hex bytes): 01 bb
Enterprise Private entry: (Project Antrea) Type 109: Value (hex bytes): 64 65 66 61 75 6c 74 2f 6b 75 62 65 72 6e 65 74 65 73 3a 68 74 74 70 73
Enterprise Private entry: (Project Antrea) Type 110: Value (hex bytes):
Enterprise Private entry: (Project Antrea) Type 111: Value (hex bytes):
Enterprise Private entry: (Project Antrea) Type 112: Value (hex bytes):
Enterprise Private entry: (Project Antrea) Type 113: Value (hex bytes):
Enterprise Private entry: (Project Antrea) Type 106: Value (hex bytes): 0a 60 00 01
@robcowart Thanks for reporting this. Did you use the Flow Aggregator service when you tested or directly exported flow records to the flow collector from Flow Exporter?
Directly exported from flow exporter.
@robcowart Does it happen every time you export flow records or it happens occasionally? @srikartati I checked all the recent exported flow records that flowStartSeconds are correct. Since the startTime in case above seems to be default value for empty time, is it possible that the startTime from conntrack table is not filled? If so, do we need to check and set value if startTime is invalid?
@zyiou Yes, that is my follow up comment. If this case is not happening for every flow, then we need to look at conntrack dumping API from github.com/ti-mo/conntrack
package from which we get startTime timestamp. It could be possible that start timestamp is not filled, then we should check if it is invalid or not and set it the first time when we convert netlink connection to antrea connection here.
@robcowart I just want to make sure the antrea configmap parameter is the default value: ovsDatapathType: system
and not netdev
. Could you confirm that?
@zyiou I checked the PCAP of records that I made. It isn't all of the records, but it is at least 80% or more. If it is relevant... this environment is not very busy. It has only a few example deployments to test the IPFIX exporter.
@srikartati ovsDatapathType
commented out, so I assume that the default of system
is in effect.
@robcowart Thanks for confirming. Could you please check if the following sysctl config is set in your system? /proc/sys/net/netfilter/nf_conntrack_timestamp
Every node in the cluster has the same value...
knode2:~# cat /proc/sys/net/netfilter/nf_conntrack_timestamp
1
Describe the bug IPFIX start and stop timestamps are wrong/too large.
To Reproduce I simply installed a new Kubernetes lab environment, and installed Antrea.
Expected Start and Stop timestamps sent in the IPFIX record should be the actual start and stop time.
Actual behavior As seen in the PCAP output below, the timestamps are ~85 1/2 years in the future.
The hex value is...
ff ff ff f1 88 6e 09 00
Versions: Please provide the following information:
Additional context Add any other context about the problem here, such as Antrea logs, kubelet logs, etc.
And the hex dump...