COVESA / vsomeip

An implementation of Scalable service-Oriented MiddlewarE over IP
Mozilla Public License 2.0
1.01k stars 647 forks source link

[BUG]: vsomeip slow to restart with lots of EventGroup #690

Open joeyoravec opened 2 weeks ago

joeyoravec commented 2 weeks ago

vSomeip Version

v3.4.10

Boost Version

1.82

Environment

Android and QNX

Describe the bug

My automotive system has *.fidl with ~3500 attributes, one per CAN signal. My *.fdepl maps each attribute into a unique EventGroup.

Especially when resuming from suspend-to-ram it's possible that UDP SOMEIP-SD will be operational but TCP socket will be broken. This leads to tce restart() but during this time any Subscribe will receive SubscribeNack in response:

4191    105.781314  10.6.0.10   10.6.0.3    SOME/IP-SD  1408    SOME/IP Service Discovery Protocol [Subscribe]
4192    105.790868  10.6.0.3    10.6.0.10   SOME/IP-SD  1396    SOME/IP Service Discovery Protocol [SubscribeNack]
4193    105.792094  10.6.0.10   10.6.0.3    SOME/IP-SD  1410    SOME/IP Service Discovery Protocol [Subscribe]
4194    105.801525  10.6.0.10   10.6.0.3    SOME/IP-SD  1410    SOME/IP Service Discovery Protocol [Subscribe]
4195    105.802118  10.6.0.3    10.6.0.10   SOME/IP-SD  1398    SOME/IP Service Discovery Protocol [SubscribeNack]
4196    105.819610  10.6.0.3    10.6.0.10   SOME/IP-SD  1398    SOME/IP Service Discovery Protocol [SubscribeNack]

as the number of EventGroup scales to a large number, this become catastrophic to performance.

In service_discovery_impl::handle_eventgroup_subscription_nack() each EventGroup calls restart(): https://github.com/COVESA/vsomeip/blob/cf497232adf84f55947f7a24e1b64e04b49f1f38/implementation/service_discovery/src/service_discovery_impl.cpp#L2517-L2521

and in tcp_client_endpoint_impl::restart() while ::CONNECTING the code will "early terminate" from maximum 5 restarts: https://github.com/COVESA/vsomeip/blob/cf497232adf84f55947f7a24e1b64e04b49f1f38/implementation/endpoints/src/tcp_client_endpoint_impl.cpp#L77-L85

thereafter the code will fall through, calling shutdown_and_close_socket_unlocked() and perform the full restart even while a connection is in progress.

As the system continues processing 1000s of SubscribeNack this will be a tight loop of 100% cpu load and multiple seconds to plow-through the workload. This can easily exceed a 2s ServiceDiscovery interval and cascade to further problems.

Reproduction Steps

My reproduction was:

but any use-case where tse closes the TCP socket but UDP is functional should be sufficient.

Expected behaviour

Performance should be better.

Logs and Screenshots

No response

joeyoravec commented 2 weeks ago

We came up with 3 possible solutions;

  1. eliminate the tce restart() call from service_discovery_impl::handle_eventgroup_subscription_nack(). It's not clear why this is required or how it would help
  2. modify tce restart() to "early terminate" better, perhaps an unlimited number of times within the 5 second timeout
  3. ensure that SOMEIP-SD gets inhibited around any event like suspend-to-ram where network communication will be lost. Try to prevent Subscribe until the TCP socket gets re-established

Interested in feedback on what would be most effective