OpenLiberty / open-liberty

Open Liberty is a highly composable, fast to start, dynamic application server runtime environment
https://openliberty.io
Eclipse Public License 2.0
1.14k stars 587 forks source link

SVT: Bulkhead waiting duration metrics appear to be calculated incorrectly #4477

Closed hanczaryk closed 6 years ago

hanczaryk commented 6 years ago

Using Liberty build from 8/6, I've setup ICp with Prometheus to leverage the new metrics for FT 1.1. I was testing the changes made as a result of issue #4378

I'm confused as to the new values reported.

For example, after my 10 min SVT ProShop app run I encountered the following reported metrics

Element - application:ft_myservlets_beans_bulkhead_bean_exec_sql_query_bulkhead_waiting_duration_mean_seconds{instance="10.1.129.132:9080",job="deb-liberty"}
Value - 11790.49718973848

Element - application:ft_myservlets_beans_bulkhead_bean_exec_sql_query_bulkhead_waiting_duration_max_seconds{instance="10.1.129.132:9080",job="deb-liberty"} Value - 240320.739806828

I may not be understanding the metrics correctly, but anything reported in seconds with a value that large appears to be incorrect.

FYI, here are the following different quantile reported values application:ft_myservlets_beans_bulkhead_bean_exec_sql_query_bulkhead_waiting_duration_seconds{instance="10.1.129.132:9080",job="deb-liberty",quantile="0.5"} | 0.011559017000000001 application:ft_myservlets_beans_bulkhead_bean_exec_sql_query_bulkhead_waiting_duration_seconds{instance="10.1.129.132:9080",job="deb-liberty",quantile="0.75"} | 0.015268878000000001 application:ft_myservlets_beans_bulkhead_bean_exec_sql_query_bulkhead_waiting_duration_seconds{instance="10.1.129.132:9080",job="deb-liberty",quantile="0.95"} | 0.591072783 application:ft_myservlets_beans_bulkhead_bean_exec_sql_query_bulkhead_waiting_duration_seconds{instance="10.1.129.132:9080",job="deb-liberty",quantile="0.98"} | 240305.144802424 application:ft_myservlets_beans_bulkhead_bean_exec_sql_query_bulkhead_waiting_duration_seconds{instance="10.1.129.132:9080",job="deb-liberty",quantile="0.99"} | 240310.07520493903 application:ft_myservlets_beans_bulkhead_bean_exec_sql_query_bulkhead_waiting_duration_seconds{instance="10.1.129.132:9080",job="deb-liberty",quantile="0.999"} | 240320.739806828

I didn't observe any functional errors for my app load with the jmeter results uploaded to http://rtpgsa.ibm.com/home/h/a/hanczar/web/public/ICPresults/ProShop/daily/jmeterResults/

hanczaryk commented 6 years ago

This has been opened for 7 days and no one has yet been assigned. Should someone be assigned with the release closing shortly ?

hanczaryk commented 6 years ago

FYI, I've run 4+days of app load and the following are the reported values.

application:ft_myservlets_beans_bulkhead_bean_exec_sql_query_bulkhead_waiting_duration_seconds{instance="10.1.3.6:9080",job="deb-liberty",quantile="0.5"} | 0.011705990000000001 application:ft_myservlets_beans_bulkhead_bean_exec_sql_query_bulkhead_waiting_duration_seconds{instance="10.1.3.6:9080",job="deb-liberty",quantile="0.75"} | 0.015756683 application:ft_myservlets_beans_bulkhead_bean_exec_sql_query_bulkhead_waiting_duration_seconds{instance="10.1.3.6:9080",job="deb-liberty",quantile="0.95"} | 0.629619695 application:ft_myservlets_beans_bulkhead_bean_exec_sql_query_bulkhead_waiting_duration_seconds{instance="10.1.3.6:9080",job="deb-liberty",quantile="0.98"} | 925700.118886474 application:ft_myservlets_beans_bulkhead_bean_exec_sql_query_bulkhead_waiting_duration_seconds{instance="10.1.3.6:9080",job="deb-liberty",quantile="0.99"} | 925711.7505301931 application:ft_myservlets_beans_bulkhead_bean_exec_sql_query_bulkhead_waiting_duration_seconds{instance="10.1.3.6:9080",job="deb-liberty",quantile="0.999"} | 925750.6278539441

The drastic difference between 95th -> 98th quantile doesn't seem to match my experience. I've uploaded the latest 24 hours of jmeter app load results to https://rtpgsa.ibm.com/projects/w/wassvt/TestResults/18.0.0.3/API-1/APSC_FT1.1_ProShop_3Day/jmeterResults/

The jmeter results show the responseTime (ms) for my CDIBulkhead servlet as follows 90th pct | 95th pct | 99th pct 115.00 | 414.95 | 1213.99

Azquelt commented 6 years ago

Ok, I think I see where this is going wrong.

We're submitting the task to be run asynchronously, then we're recording the start time. When the task is dequeued, we compare the current time with the start time to see how long it spent in the queue.

I suspect it's possible that the task gets dequeued before we record the start time. I will change the ordering so that we record the start time before we queue the task.

Azquelt commented 6 years ago

The fix for this is included in https://wasrtc.hursley.ibm.com:9443/jazz/web/projects/WS-CD#action=com.ibm.team.build.viewResult&id=_BjGgQKE-EeiH1sOX9zTQgw