billgraziano / xelogstash

Send SQL Server Extended Events to Logstash, Elastic Search, or JSON
Other
26 stars 11 forks source link
elastic-search elasticsearch elk go golang logstash mssql mssqlserver sql-server

Extended Event Writer for SQL Server

sqlxewriter.exe is an application to pull SQL Server Extended Events and SQL Server Agent job results and write them to various sinks. It can write to Logstash, Elastic Search, or JSON files. It runs on the command-line or as a service. It supports SQL Server 2012 and higher. It has limited support for SQL Server 2008 (R2). This application replaces xelogstash.exe. If you were running xelogstash.exe, please read this carefully.

  1. Getting started
  2. What's New
  3. Sources and Defaults
  4. Controlling JSON
  5. Adds, Moves, Copies
  6. Prefixes and keeping your place
  7. Application Settings
  8. Derived Fields
  9. Sinks
  10. Beta Features
  11. Linux Support
  12. Other Notes
  13. Building

Getting Started

I've found Visual Studio Code to a very good tool to edit TOML files and view log files. Plus it installs for only the local user by default. If you use VSCode, copy the .vscode directory from the samples directory to the root of the project directory.

The application uses a TOML file for configuration. Two sample files are included.

  1. Extract the ZIP contents to a directory. The sample configuration file (sqlxewriter.toml) reads the system_health session and writes events to a file.
  2. If you have a local SQL Server installed, no changes are necessary.
  3. Otherwise, edit sqlxewriter.toml and change the fqdn under [[source]] to point to a SQL Server. If you put in a named instance, use double-backslashes instead of a single backslash. You will need VIEW SERVER STATE permission on the instance.
  4. From a command-prompt, run "sqlxewriter.exe -debug". (This doesn't send anything to Logstash or Elastic Search yet). This will pull one set of events and then exit.

This should create an events folder and an xestate folder. The events folder holds the JSON generated for each event. The xestate folder is used to keep track of the read position in each Extended Event session.

NOTE: The permissions on the xestate directory are limited. When switching to a service account, be prepared to reset the permissions to grant the service account full control of that directory and the files in it.

Writing events to a file

The files in the events directory are newline-delimited JSON (NDJSON). The files are rotated every hour. These files can be written to Elastic Search using FileBeat. A sample FileBeat configuration file is included.

Sending to Logstash

To send events to directly Logstash, specify the logstash section with a host value. The host should be in host:port format. After that you can run the executable with the same parameters.

[logstash]
host = "localhost:8888"

Now it is writing to both the file and Logstash.

Command Line Options

Running sqlxewriter -? will display the help for the options.

Running as a Windows service

In order to run this as a service in Windows, complete the following steps

  1. From an Administrative command prompt, run sqlxewriter -service install. This will install as a service named sqlxewriter. You can find it in the list of services as "XEvent Writer for SQL Server".
  2. Configure the service to auto-start
  3. Update the service to run as a service account. This service account should have VIEW SERVER STATE permission on each SQL Server it polls.
  4. Reset the permissions on the xestate directory and ALL files in that directory to grant the service account write permission.
  5. When it comes time to update the service, just stop the service and replace the executable.

A similar process should work for Linux. This uses github.com/kardianos/service. Additional documentation may be found there.

Running as Linux Service or Container

My experience here is very limited. Feedback is appreciated.

Scaling up

  1. Changes to the .toml require a service restart to take effect unless you set watch_config = true. This includes adding sources.
  2. The sample sqlxewriter.toml only reads 1 event per server per minute. This should be set to unlimited (rows = 0) or some high number like 20,000 (rows = 20000)
  3. There are two sample Extended Event session scripts. One captures logins and the other captures interesting events like errors, slow SQL, blocked processes, mirroring events, and writes to the error log.
  4. I suggest capturing the system_health, AlwaysOn_health, and these two sessions.
  5. The sources can be broken out into a separate file named sqlxewriter_sources.toml. This makes the editing easier or allows the file to be generated from Puppet or some other tool.

What's New

Release 1.8

Tag 1.7.9

This is a tag only.

Release 1.7.8

Release 1.7.6

Release 1.7.4

Release 1.7.1

Release 1.5.1

Release 1.5

Release 1.4.3

Release 1.4

Release 1.3.2

Release 1.3.1

Release 1.3.0 (Pre-Release)

Release 1.2.X (Pre-Release)

Release 1.0

Release 0.92

Release 0.91

Release 0.9

Sources and Defaults

Each SQL Server you want to extract events from is called a "source". You can specify each source using the [[source]] header in the TOML file. (This is an array of sources.) The only required field for a source is the fqdn which is how you connect to the server.

The [defaults] section applies to all sources (but can be overridden by the source). I find it's easiest to set most of your values in this section and then override them for each source. The "default" section is just a source itself. Note: The source values override the default values except for adds, copies and moves which are discussed below. Those are merged together.

You can set the following Source fields (or Default fields)

Controlling the JSON

The two fields timestamp_field_name and payload_field_name are available in the Source and Default sections. The following examples best illustrate how they work.

Example 1

All the event fields are at the root level.

- - - sqlxewriter.toml - - - 

timestamp_field_name = "@timestamp"
payload_field_name = ""

 - - - generates - - - - - 

{
  "@timestamp": "2018-05-08T01:23:45.691Z",
  "client_app_name": "sqlxewriter.exe",
  "client_hostname": "D30",
  ". . . lots of json fields...": "go here",
  "xe_severity_value": 6
}

Example 2

All the event fields are nested inside an "mssql" field. This is the most common way to run the application.

- - - sqlxewriter.toml - - - 

timestamp_field_name = "event_time"
payload_field_name = "mssql"

- - - generates - - - 

{
  "event_time": "2018-05-08T01:23:49.928Z",
  "mssql": {
    "client_app_name": "sqlxewriter.exe",
    "client_hostname": "D30",
    ". . . lots of json fields...": "go here",
    "xe_severity_value": 6
  }
}

Add, Copies, and Moves

Adds, moves, and copies give you the opportunity to modify the generated JSON. All three are arrays with a format of "string1:string2".

Note: For these values, any Source overwrites the Default at the individual key level. If both the default and source try to add a key for "environment", it will use the value from the Source. But if the Default adds a key for "datacenter" and the Source adds a key for "environment", it will add both keys. Copies and Moves are handled the same way.

Further, the keys can be nested using a dotted notation. For example, setting a key at global.host.name will nest the value three levels deep.

Consider the following settings:

timestamp_field_name = "@timestamp"
payload_field_name = "event"

adds =  [   "global.log.vendor:Microsoft",
            "global.log.type:Application",
            "global.log.collector.application:sqlxewriter.exe",
            "global.log.collector.version:'$(VERSION)'",
        ] 

copies = [  "event.mssql_computer:global.host.name",
            "event.mssql_domain:global.host.domain",
            "event.mssql_version:global.log.version"
        ]

That results in this event:

{
  "global": {
    "host": {
      "domain": "WORKGROUP",
      "name": "D30"
    },
    "log": {
      "version": "SQL Server 2016 RTM",
      "collector": {
        "version": "0.12",
        "application": "sqlxewriter.exe"
      },
      "type": "Application",
      "vendor": "Microsoft"
    }
  },
  "@timestamp": "2018-05-08T01:24:47.368Z",
  "event": {
    "client_app_name": "xecap.exe",
    "client_hostname": "D30",

    "mssql_computer": "D30",
    "mssql_domain": "WORKGROUP",
    "mssql_server_name": "D30\\SQL2016",
    "mssql_version": "SQL Server 2016 RTM",
    "name": "login",

    "server_instance_name": "D30\\SQL2016",
    "server_principal_name": "graz",
    "timestamp": "2018-05-08T01:24:47.368Z",
    "xe_acct_app": "graz - xecap.exe",

    "xe_session_name": "Demo-Logins2",
    "xe_severity_keyword": "info",
    "xe_severity_value": 6
  }
}

Note how the copies are able to "lift" values out of the event and into other parts of the JSON document. This helps conform to whatever standards are in your ELK environment.

The values that are added can be strings, integers, floats, booleans, or dates. Putting a value in single ticks forces it to be a string.

Replacements

The adds, moves, and copies also support a few "replacement" values.

See the section below on derived fields for a description of the "mssql" and "xe" fields

Upper and Lower Case Fields

SQL Server generally returns fields in a consistent case. However I've started to see @@SERVERNAME returning lower case on some servers. Since Elastic Search is case-sensitive this can be challenging. Additionally certain fields my better in upper or lower case. Field case can be controlled using these fields:

uppercase = ["mssql.server_name", "mssql.server_princpal_name"]
lowercase = ["global.host.name"]

These can be set in the defaults section of the TOML file and in each individual source. The source specific entries are processed after the defaults.

These are processed after adds, moves, and copies.

Prefixes and keeping your place

The application keeps track how far it has read into the extended event file target using a state file. This file holds the file name and offset of each read for that session. The file is named Domain_ServerName_Session.state. There is also a ".0" file that is used while the application is running. You can tell the application to start all over by deleting the state file. The "ServerName" above is populated by @@SERVERNAME from the instance.

Application Settings

These are the fields you can set in the [app] section of the configuration file.

[app] section

This controls the overall application. All these fields are optional.

Filters

A series of filters can be added the the TOML configuration file. That looks like this:

[[filter]]
filter_action = "exclude"
error_number = 15151

[[filter]]
filter_action = "include"
error_number = 15151
server_instance_name = "server01"

The filter_action must be include or exclude. Each filter that matches ALL fields sets the action based on the that value. After all filters have processed, the event is either included or excluded.

For example, this means that a broad filter could exclude an event but a later more specific filter could change the action to include it.

In the example above, all 15151 errors are excluded except for "server01".

Derived Fields

Based on a particular event, the application computes a number of calculated fields and adds those to the event. Most of them have an "xe" prefix to separate them. It also returns a few SQL Server level settings with an "mssql" prefix.

Sinks

XEWriter can write to multiple targets called "sinks". It can write to files, to logstash, or directly to Elastic Search. It can write to all three sinks at the same time if they are all specified. They are written serially so the performance isn't that great.

File Sink

This is configured with a filesink section:

[filesink]
retain_hours = 24

The files are named for the server, instance, session name, date, and hour and are written to an events directory with a .json extension. The files are rotated every hour. Any files older than retain_hours are removed.

These files should be imported into Logstash or Elastic Search using FileBeat. This is probably the simplest way to import.

Logstash Sink

This is configured with a logstash section:

[logstash]
host = "localhost:8888"

This writes the events directly to the specified logstash server.

Elastic Sink

This is configured using the elastic section. This is the most complicated to configure.

[elastic]
addresses = ["https://host.domain.com:1234"]
username = "dev-user"
password = "horsebatterysomethingsomething"

auto_create_indexes = true
default_index = "dev-sql"

event_index_map = [
    "login:dev-login"
]

Sampler Sink

This is configured using the sampler section. This writes sample events for review. It is primarily used in development. It writes one file per extended event type. The files are located in ./sinks/sampler.

[sampler]
duration = "1m"

This will write an event type every one minute. The duration is a sequence of decimal numbers, each with a unit suffix, without spaces. Valid units are "h", "m", "s". Examples include "10m", "1h30m", "1m", "10s".

Beta Features

Enabling beta_features in the app section will enable certain beta features:

[app]
beta_features = true

As of 1.7.8, this adds the following calculated fields: cpu_time_sec, logical_reads_mb, physical_reads_mb, writes_mb, and duration_sec. These fields are only added if the quantity at or above the threshold. For example, cpu_time_sec is only added if the CPU time is at least 1 second (1 million microseconds).

Linux Support

Experimental support is included for Linux. Please be aware of the following issues:

Other Notes

  1. I find that setting the rows = 20000 in the [defaults] section works well. It's enough rows that it catches up quickly if I pause the job.

  2. The sources are processed in the order they are listed. Each server is polled every minute. It spreads out the servers evenly over the minute.

  3. I make some decisions around setting the severity level. Failed jobs and job steps are errors. SQL Server errors are errors. I haven't gone much beyond that yet.

  4. I haven't done much in the way of optimizations yet. It will process between 2,000 and 3,000 events per second on my aging desktop with SQL Server running on the same box. A properly scaled Logstash doesn't slow it down much. I have a few servers that keep 1 GB of login events in 50 MB files. It takes roughly 20 minutes to get through it the first time.

  5. If it gets behind and the offset becomes invalid, it will log an error. It will also set a flag to try and catch up next time. That flag is a third column in the status file that says "reset". If it finds that, it will start at the beginning of the extended event file target and read until it gets to an event after the last one it saw. It will also log an error that events were probably skipped.

Building the Application