envoyproxy / envoy

Cloud-native high-performance edge/middle/service proxy
https://www.envoyproxy.io
Apache License 2.0
24.67k stars 4.75k forks source link

downstream_rx_datagram_dropped shows big number with big jump when running UDP traffic. #35142

Open nev888 opened 1 month ago

nev888 commented 1 month ago

listener..udp.downstream_rx_datagram_dropped shows big number and jumps up with really big numbers when running UDP traffic. Collecting the stats during few minutes: listener..udp.downstream_rx_datagram_dropped: 68763983128 listener..udp.downstream_rx_datagram_dropped: 210235537380 listener..udp.downstream_rx_datagram_dropped: 210235537380 listener..udp.downstream_rx_datagram_dropped: 210235537380 listener..udp.downstream_rx_datagram_dropped: 211333341100 listener..udp.downstream_rx_datagram_dropped: 212764708258 listener..udp.downstream_rx_datagram_dropped: 215293879136 listener..udp.downstream_rx_datagram_dropped: 215973672978 (last one after stopping udp traffic) listener..udp.downstream_rx_datagram_dropped: 215973672978 listener..udp.downstream_rx_datagram_dropped: 215973672978 listener..udp.downstream_rx_datagram_dropped: 215973672978

If the counter displays number of datagrams dropped for a specific listener? These numbers don't look realistic from traffic perspective. There are not that many traffic at all.

We are using envoy as L7 load balancer for sip traffic. On client side (downstream) traffic is received on TCP/UDP, traffic is load balanced to the application Pod (upstream) over GRPC.

stats.txt server_info.txt clsuters.txt

Envoy code is extended with our own for the specific use case we have.

nezdolik commented 1 month ago

cc @mattklein123 @danzh2010

danzh2010 commented 1 month ago

Please adjust your listen socket's receive buffer size; https://www.envoyproxy.io/docs/envoy/latest/intro/arch_overview/http/http3#downstream-stats

nev888 commented 1 month ago

Thanks @danzh2010, If kernel’s UDP listen socket’s receive buffer isn’t large enough, Is it possible it causes other issues? We came across this counter value while investigating a memory leak.

danzh2010 commented 1 month ago

Thanks @danzh2010, If kernel’s UDP listen socket’s receive buffer isn’t large enough, Is it possible it causes other issues? We came across this counter value while investigating a memory leak.

It will become bandwidth limitation, but not memory leak.

nev888 commented 1 month ago

Any recommendation how big it should be? currently it's ~9.5mb.

danzh2010 commented 1 month ago

Any recommendation how big it should be? currently it's ~9.5mb.

This depends on your bandwidth, and the Linux kernel doubles the number your supplied via setsockopt.

Also please keep in mind that the stats is accumulative.

nev888 commented 1 month ago

One more question, Do you have any idea why the counter number is not in sync with traffic rate? First measurement is 68763983128, few mins later it became 210235537380. I had running traffic for few hours, traffic rate were ~40 call/sec. I have generated truncated UDP traffic for few mins too but not even close to those numbers I see in the counter.
All the generated traffic could have reached ~2million max, and not all of it were UDP, only half of it.

danzh2010 commented 1 month ago

I don't know the ingress rate of your service. Assuming it's 5min range, you have ~400M packet drops per sec. You can check netstat -p udp to confirm the numbers are consistent with what kernel sees.

nev888 commented 1 month ago

The previous stat were from a Pod which is not running anymore. I have a different pod with these stats: Traffic were generated with a script, the rate was a rough estimate, Packets per second: 38873 The downstream_rx_datagram_dropped jumped to that big number from 0, the script were running for 1.5hour listener.IPv4_PORT.udp.downstream_rx_datagram_dropped: 61610884148229

bash-4.4$ nstat IpInReceives 3021072 IpInDelivers 3021072 IpOutRequests 3020920 TcpActiveOpens 17 TcpPassiveOpens 93 TcpEstabResets 7 TcpInSegs 1023 TcpOutSegs 871 TcpOutRsts 32 UdpInDatagrams 690 UdpInErrors 3019359 UdpOutDatagrams 690 UdpInCsumErrors 3019359 TcpExtTCPHPHits 174 TcpExtTCPPureAcks 203 TcpExtTCPHPAcks 317 TcpExtTCPAbortOnData 10 TcpExtTCPAbortOnClose 7 TcpExtTCPRcvCoalesce 9 TcpExtTCPOrigDataSent 437 TcpExtTCPDelivered 454 IpExtInOctets 779239070 IpExtOutOctets 779149461 IpExtInNoECTPkts 3021072

danzh2010 commented 1 month ago

nstat only shows incremented values since last run, please use 'nstat -a'. And I didn't see the result had dropped packets count. Can you use netstat -p udp?

danzh2010 commented 1 month ago

Which UDP extension are you using?

nev888 commented 1 month ago

nstat only shows incremented values since last run, please use 'nstat -a'. And I didn't see the result had dropped packets count. Can you use netstat -p udp?

Tue Jul 16 07:39:46 CEST 2024 listener.IPv4:Port.udp.downstream_rx_datagram_dropped: 89237114699 nstat-2024-07-16_07-39.txt

I don't have net netstat, I can use ss though. udp-sockets-2024-07-16_10-59.txt

nev888 commented 1 month ago

Which UDP extension are you using?

We don't use any UDP extension.

danzh2010 commented 1 month ago

Which UDP extension are you using?

We don't use any UDP extension.

Are you using UDP Proxy?

nev888 commented 1 month ago

No,

danzh2010 commented 1 month ago

Can you share your UDP listener config?

nev888 commented 1 month ago

Here the config for the listener udp_listener.txt

github-actions[bot] commented 3 weeks ago

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

danzh2010 commented 3 weeks ago

If you are using raw UDP, why do you need this?

       "connection_balance_config": {
        "exact_balance": {}
       },
nev888 commented 3 weeks ago

This config is intended for TCP listeners. On the controller side both UDP and TCP are configured with the same config that's why we have this here.

danzh2010 commented 3 weeks ago

UDP listener other than QUIC is connectionless, you probably don't need that.

nev888 commented 3 weeks ago

Yep, in UDP case we have no use for it. Do you think this might have anything to do with the counters problem?

danzh2010 commented 3 weeks ago

Not sure. I'm not familiar with raw UDP listener interaction with connection_balance_config. It is the cause, you may see a warning log about packet being dropped in only some of threads (not all) in Envoy log.

shakedm commented 1 week ago

I see a similar problem with a set up to test UDP in envoy. the dropped datagrams are in the tens of billions per second while the in traffic is just a couple hundred thousands, could be a counter issue? any idea if it's from envoy collecting the metrics or deeper down?