amachanic / sp_whoisactive

sp_whoisactive
GNU General Public License v3.0
1.17k stars 289 forks source link

Split out wait_info into XML fields #35

Closed jzabroski closed 3 years ago

jzabroski commented 3 years ago

I am planning to write a parser to handle what is currently explained in the documentation at:

http://whoisactive.com/docs/16_morewaits/

Depending on how many tasks are waiting for each wait type, Who is Active breaks out the times as follows:

  • One waiting task: (1x: MINms)[wait_type] where MINms is the number of milliseconds that the task has been waiting
  • Two waiting tasks: (2x: MINms/MAXms)[wait_type] where MINms is the shorter wait duration between the two tasks, and MAXms is the longer wait duration between the two tasks
  • Three or more waiting tasks: (Nx: MINms/AVGms/MAXms)[wait_type] where Nx is the number of tasks, MINms is the shortest wait duration of the tasks, AVGms is the average wait duration of the tasks, and MAXms is the longest wait duration of the tasks

I propose instead of returning wait_info as a string, we return wait_info as an XML object, something like:

<wait_info type="OneWaitingTask">
  <task wait_type="ASYNC_NETWORK_IO" milliseconds="MINms" />
</wait_info>

<wait_info type="TwoWaitingTasks">
  <taskSummary wait_type="ASYNC_NETWORK_IO" minMilliseconds="MINms" maxMilliseconds="MAXms" />
</wait_info>

<wait_info type="ThreeOrMoreWaitingTasks">
  <taskSummary wait_type="ASYNC_NETWORK_IO" minMilliseconds="MINms" avgMilliseconds="AVGms" maxMilliseconds="MAXms" />
</wait_info>

Also... It is not clear to me what Two waiting tasks does when the wait_types are different (I think this is possible but likely a ultra rare scenario and thus has never been reported as a bug). I think this is probably a documentation descriptive issue more than a logic issue, but I plan to dive into the logic to see if it's also logically buggy.

amachanic commented 3 years ago

@jzabroski Just to clarify, the "tasks" as written in the docs refer to tasks as grouped by wait type. So there's no possibility, even remote, of their being different.

As for your XML approach, it's an interesting idea, but it's not a good default option. Reason being, I the current setup is optimized for quick on-screen reading, and XML will require someone to click to see the results. Waits are the kind of thing that I look for right upfront, and I really don't want to mess with that general workflow.

We could potentially make an option for it. What are the use cases here?

Regarding the suggested XML, I think I'd rather either make the "[n]Milliseconds" attributes sparse, or just always populate all three, rather than create three node "types." (minMilliseconds could do double-duty as milliseconds, in the sparse case.)

jzabroski commented 3 years ago

@amachanic Thanks for clarifying tasks = wait_type.

We could potentially make an option for it. What are the use cases here?

I want to create a report that emails me when certain conditions are met. In particular, when there is a long ASYNC_NETWORK_IO wait that is the blocking session leader. I have not nailed down the exact parameters, but I imagine being able to derive "long" wait type as parsing the MINms field and say, alerting if it's active for more than 5 minutes.

The back story for this is that a customer has some Excel spreadsheets that issue raw database queries (have tried to convince them several times this is a bad idea on many levels, am getting closer to getting them to agree, but for now they are happy with their terrible solution as compared with me working on other projects vs. cleaning up TENS OF THOUSANDS of Excel spreadsheets), and rarely (about ~5 times a year), under certain conditions, an ASYNC_NETWORK_IO wait can cause a livelock chain when a user locks there computer screen in the middle of an Excel database connection update. The livelock occurs because Excel uses immediate mode to draw to the screen, but locking the computer causes the user session to be locked, and so the Excel background database thread has nowhere to spool results to. It is a live lock because while ordinarily ASYNC_NETWORK_IO should not cause lower level locks, there are some cases where the lock promotion involves a table lock, which in turn means until ASYNC_NETWORK_IO lock is released, the other locks don't get released, either. (Most commercial SQL performance analytics tools will tell you ASYNC_NETWORK_IO cannot cause database contention issues, and they are wrong. I have seen it with my own eyes.)

I thought about just writing my own query to check for this, but given how amazing of a job you've done documenting sp_WhoIsActive, it just seems like a dumb thing to roll my own.

jzabroski commented 3 years ago

Regarding the suggested XML, I think I'd rather either make the "[n]Milliseconds" attributes sparse, or just always populate all three, rather than create three node "types." (minMilliseconds could do double-duty as milliseconds, in the sparse case

The proposed XML was just a strawman proposal to flesh out what I roughly want. I agree the strawman was not a good API data contract. It was merely transliterating how I read the docs. It is bare minimum acceptable.

jzabroski commented 3 years ago

For what it's worth, there is a StackOverflow thread I created ~4 years ago about this "back story": https://dba.stackexchange.com/questions/179585/high-async-network-io-when-user-locks-computer-with-excel-open/181886

The funny thing is, the answer to the problem is literally in the thread title, I just didn't know how to piece it all together at the time.

jzabroski commented 3 years ago

Found an interesting behavior in sp_WhoIsActive on the version I am running locally.

wait_info column can output (54ms)CXCONSUMER, which is undocumented. My parser was expecting (1x: 54ms)CXCONSUMER). There is no version number in the stored procedure text, so I am not sure how to validate if this behavior exists with the latest version other than updating to latest (my version is within the last two years).

amachanic commented 3 years ago

@jzabroski That format has been around since the beginning and is very much documented: http://whoisactive.com/docs/15_waits/

I'm not sure why you don't have a version number in your stored procedure text, but all official releases for the last decade have had it in the header.

jzabroski commented 3 years ago

@amachanic I figured out why just now - there is a "feature" in our deployment tool that strips all comments . Sorry for that.

erikdarlingdata commented 3 years ago

@jzabroski Since you haven't reported back on this, I'm gonna assume that you're all set here. If you're still gonna work on this, let us know and we'll re-open it.