DataDog / dd-sdk-ios

Datadog SDK for iOS - Swift and Objective-C.
Apache License 2.0
206 stars 123 forks source link

DataDog Performance Issues #956

Closed charlie-foreflight closed 1 year ago

charlie-foreflight commented 2 years ago

We have been noticing performance issues with Datadog iOS SDK. Our application has 2 different logging methods: logging to a local file & Datadog.

After doing a stress test (10k logs every 2 seconds), the logging to a local file solution does have slight continuous memory increases, and a slight increase of CPU usage (40% or so). Whereas the Datadog iOS SDK has massive memory increases (1-2 GB in a minute, much higher than local file solution), and extremely high CPU usage (>100%).

What is really strange, is memory continues to increase for Datadog even after the stress test is turned off. After a while, it will stop using CPU usage, and memory will drop back down suddenly to normal levels. Whereas with the local file solution, memory usage will decline slowly overtime as it continues to write to disk.

From my understanding, Datadog logs all logs to disk originally, then sends to the server afterwards. Therefore I expect the performance characteristics to be somewhat similar between these two options.

However the continuous memory usage even after stopping the logging for Datadog and much higher CPU consumption is concerning.

I did notice in the code that this queue is user interactive, which is strange to me. Nothing about logging is user interactive. What was the thought process behind making this user interactive?

https://github.com/DataDog/dd-sdk-ios/blob/a7f4b50133de9243caf9d094ce9f604f8c1d60bb/Sources/Datadog/Logging/RemoteLogger.swift#L62-L65

ncreated commented 2 years ago

Hello @charlie-foreflight πŸ‘‹. Thank you for looking at our SDK πŸ™‚. Yes, you're right - the SDK uses file storage under the hood and the expectation of measuring similar performance characteristic to regular write-to-file logging is reasonable.

We are aware of existing performance limitations and we're actively working on bringing improvements. These however include deep architectural changes and will be part of SDK 2.x - coming soon both iOS and Android. To address your observations, let me explain what issue we've diagnosed and what actions we execute to improve it.

The main reason of the performance bottleneck you described (and also significant difference to simple write-to-file logging) is context enrichment that our SDK does automatically. All collected data gets decorated with 3 kinds of context: user context (e.g. user info), device context (e.g. network connection status) and internal context (e.g. all logs can be bundled with RUM information). Our 1.x SDK assembles Log context by running critical section in each partial-context provider (user info provider, network status provider etc.). In rare cases, this requires running blocking operation on the caller thread (e.g. this queue.sync {} for assembling user attributes in Logger - which is also why we use prioritised "user interactive" queue, to reduce the impact).

To address this 1.x performance problem, we have changed the way SDK context will be managed in 2.x (#916). Instead of using distinct providers requiring separate critical sections, we're working on centralising the context mutation in single provider and using a single queue. With this change, the internal data flow will become asynchronous, non-blocking and more performant (eventually, allowing us to scale-up number of internal queues in stress situations). This is part of our ongoing work that can be tracked it on feature/v2 branch. It should be available later this year in 2.0.0.

I hope this brings some answers to your observations @charlie-foreflight - let me know πŸ™‚.

charlie-foreflight commented 2 years ago

@ncreated Thanks so much for your detailed writeup. We have lowered the log volume on our end to help reduce this issue in meantime. We will keep our eye out for version 2.0.0, and will work towards upgrading when it comes out. I will close this issue for now since it seems like it's being tracked elsewhere, and in progress. If that is not the case feel free to reopen.

charlie-foreflight commented 1 year ago

Reopening, so that we can track it internally.

ncreated commented 1 year ago

Hello @charlie-foreflight πŸ‘‹. I want to refresh this thread and share more on the current status of SDK performance. Like I mentioned earlier, we're on the course to 2.0.0 and all our work is applied incrementally and available in succeeding releases (including perf improvements).

To measure improvements, I made following benchmark comparing 3 versions of the SDK: 1.11.1 (originally pointed in this thread), 1.14.0 (recent one) and develop https://github.com/DataDog/dd-sdk-ios/commit/d19f328d27574dead3ed9439141b6f54d38c575d (next one).

1.11.1

MEM peak: 242MB CPU peak: 184%
1 11 1

1.14.0

MEM peak: 130MB CPU peak: 241%
1 14 0

develop https://github.com/DataDog/dd-sdk-ios/commit/d19f328d27574dead3ed9439141b6f54d38c575d

MEM peak: 67MB CPU peak: 177%
develop

By this we hope that both issues originally reported in this thread are mitigated (1.14.0) and will be solved in upcoming 1.15.0 release. In particular I mean:

We're still concerned by the fact that memory increases during the test, but this could be a problem of hitting SDK's throughput limit for certain device, i.a.: on iPhone 13 Pro Max the trend is there for 2K events/s but allocation graph is flat for 100 events/s. This is the part we're still figuring out (we're constantly iterating on the core components) - we wonder if you have a certain throughput target in mind that could help us setting the bar?

On top of that, thank you again for bringing this issue to the surface πŸ™‚ - it is extremely helpful to understand expectations people put on our SDK's perf.

maxep commented 1 year ago

2.0 is now released and address performance issues as described in @ncreated 's comment