Azure / data-api-builder

Data API builder provides modern REST and GraphQL endpoints to your Azure Databases and on-prem stores.
https://aka.ms/dab/docs
MIT License
815 stars 148 forks source link

[Feature Request]: Configurable Telemetry Throttling for High Throughput Scenarios #2287

Open yhsparrow opened 2 weeks ago

yhsparrow commented 2 weeks ago

What happened?

Issue Description: While conducting load testing on an API utilizing the DataApiBuilder (DAB), it was observed that the telemetry data available in Live Metrics and other Application Insights tools only reflects a fraction of the actual traffic. Specifically, during a test simulating 1000 requests per second, Live Metrics reported processing only around 4 requests per second. This discrepancy is believed to be due to the current telemetry throttling mechanism, which significantly samples or throttles the telemetry data, thus not providing a true picture of the system's performance under load.

Impact: This issue makes it challenging to accurately monitor and assess the application's performance and health during high-load scenarios, which is critical for capacity planning, performance tuning, and ensuring the reliability of the service.

Steps to Reproduce:

  1. Set up a basic API using DataApiBuilder.
  2. Configure Application Insights for telemetry collection.
  3. Conduct a load test simulating 1000 requests per second to the API.
  4. Observe the telemetry data in Application Insights Live Metrics.

Expected Behavior: The telemetry data in Application Insights should accurately reflect the load generated by the test, allowing for real-time monitoring and analysis of the application's performance under stress.

Actual Behavior: The telemetry data significantly underreports the actual traffic, indicating only about 4 requests per second in Live Metrics, due to aggressive telemetry sampling or throttling.

Feature Request: I propose the introduction of a feature or configuration option within DAB that allows users to adjust the level of telemetry throttling or sampling, especially for scenarios requiring precise monitoring and diagnostics, such as load testing or performance benchmarking. This feature would enable developers and system administrators to get a more accurate picture of the application's behavior under various load conditions, improving the observability and manageability of applications built with DAB.

Potential Benefits:

  1. Enhanced monitoring and diagnostics capabilities for high-load scenarios.
  2. Improved accuracy of performance metrics, aiding in more effective capacity planning and performance tuning.
  3. Increased flexibility in telemetry management, allowing for tailored configurations based on the specific needs of the application and environment.

Thank you for considering this feature request. I believe it would significantly enhance the utility and flexibility of monitoring, performance testing and benchmarking for DAB, especially for applications with high throughput demands.

Version

1.1.7

What database are you using?

Azure SQL

What hosting model are you using?

Container Apps

Which API approach are you accessing DAB through?

REST

Relevant log output

No response

Code of Conduct

abhishekkumams commented 1 week ago

@yhsparrow, Thanks for flagging this issue and diving deep into the challenges you're hitting with telemetry data in Application Insights during your load tests. Really appreciate the detailed rundown!

I wanted to highlight that Azure Application Insights does implement throttling mechanisms to manage the volume of telemetry data being sent to ensure the service remains performant and cost-effective. Adaptive sampling is enabled by default in all the latest versions of the Application Insights. ref: https://learn.microsoft.com/azure/azure-monitor/app/sampling-classic-api

But as you highlighted, I understand the need for more detailed control over telemetry throttling and sampling, especially when fr something like load testing or trying to benchmark performance.

We will review the same. While I cannot provide an immediate timeline for this feature's potential release, we will keep this thread updated with any new changes.

JerryNixon commented 1 week ago

If this were configurable, throttling at the AI level still could result in the same result. This is worth investigating to see if a configuration setting (or settings) could mitigate this or if we should address testing scenarios or high-volume scenarios through different sinks instead (just one idea). We don't want to solve an issue by introducing two more.

abhishekkumams commented 1 week ago

@yhsparrow , It would really help us if you could add some screenshot as well. Also, were you seeing around 4req/sec on the performance tab. Also, could you also check in the Metrics Tab under monitoring and tell us what you see?