emiller42 / splunk-statsd-backend

A backend plugin for Statsd to output metrics to the Splunk HTTP Event Collector (HEC)
MIT License
11 stars 2 forks source link

Unsuccessful sending JSON events to Splunk #4

Closed lewismc closed 3 years ago

lewismc commented 3 years ago

Hi @emiller42, we recently picked up this backend and have been experimenting sending events from Apache Airflow --> StatsD --> splunk-statsd-backend --> Splunk. It looks like the JSON events we are sending look like

{"rate":0.2,"count":2,"metricType":"counter","metricName":"airflow.scheduler_heartbeat"}

... whereas the statsd-splunk-backend documentation states that the JSON should look like

{ "time": <timestamp>, "source": "my_source", "sourcetype": "my_sourcetype", "index": "my_index", "event": {...event payload...} }

Questions

  1. I suppose my first question, is this project still alive? Please don't take this the wrong way, I just noticed that there hasn't been a commit for quite some time.
  2. If it is alive, can you comment on the discrepancy in the JSON event above?

If this backend needs to be augmented to accommodate new versions of Splunk, I am happy to do that with a PR. I just want to know if this project is alive though! Thank you

emiller42 commented 3 years ago

Hi @lewismc! Thanks for reaching out! To your questions:

  1. This project isn't exactly under active development, as I haven't been using either Splunk or Statsd for a few years now. However, I'm more than happy to try to address any issues that may have arisen and/or review PRs.
  2. Can you clarify the behavior you're seeing? Is the example JSON what you're seeing in the Splunk UI as search results, or is it the POST payload that Splunk HEC is receiving from Statsd? When searching the data, are the relevant metadata fields (time/index/source/sourcetype) set correctly? Can you also tell me which versions of Statsd and Splunk you're seeing this behavior on?

The behavior I would expect to see is that the full JSON would be sent to Splunk, with the example you provided included as the event property. That event field becomes the raw text of the event, with the rest of the fields controlling event metadata. Then when searching the data, you would would see correct event metadata (time/source/sourcetype/index) but only the event itself would be included in the raw text.

lewismc commented 3 years ago

Thanks for the quick response @emiller42. I'm collecting all of the info to respond thoroughly.

lewismc commented 3 years ago

Hi @emiller42 here's an update. It's important for me to state that we are now on Splunk v8, in which metrics data has changed a bit. With that in mind, it may be the case that the splunk-statsd-backend needs to be updated to reflect those changes.

In response to the actual question, that is the _raw payload after the data has been received and processed by a heavy forwarder. Metadata fields are being processed just fine, it’s the index-time metrics field extraction that is the issue. We are seeing the following error:

INDEXER has the following message: The metric event is not properly structured, source=statsd, sourcetype=_json, host=OMITTED, index=metrics_data. Metric event data without a metric name and preoperly formated numerical values are invalid and cannot be indexed. Ensure the input metric data is not malformed, have one or more keys of the form "metric_name:" (e.g..."metric_name:cpu.idle") with corresponding floating point values.

Let me know if this helps.

emiller42 commented 3 years ago

That detail definitely helps! It looks like you're attempting to ingest the data as Metrics, which IIRC wasn't yet available when I wrote this backend. I'll see if I can get this playing nicely with Splunk Metrics over the weekend. (likely behind a config toggle so it isn't breaking for anyone else who might be using this backend)

I also notice that Splunk appears to directly support StatsD. Is this something you've looked into already?

lewismc commented 3 years ago

It looks like you're attempting to ingest the data as Metrics,

Yes

I also notice that Splunk appears to directly support StatsD. Is this something you've looked into already?

Actually no it isn't. I'm going to have a look at that tomorrow.

We only started work on this at the beginning of the week so still learning more. Also, we are the first team (in work) to attempt to ingest StatsD metrics into Splunk... so we are kinda breaking ground so to speak.

lewismc commented 3 years ago

Hi @emiller42 according to my Splunk admin, the reason we haven’t gone down the route of the direct input for statsd is that deploying an entire cluster-wide change to receive data like that is a heavy effort. We are attempting to get this input working using Cribl. For now though, deploying the cluster-wide change for Splunk to receive direct StatsD output is not an option. Quick question, have you started working on anything yet? If not, I am happy to dive in and get my hands dirty :) Thank you

emiller42 commented 3 years ago

I've been catching up on the implementation of metrics via HEC.

Based on the above, I think I would send a single payload for all metrics of a given metric type. (with a field indicating the metric_type) This means that if all four supported metrics types are used, there would be four POST payloads per flush interval. (Example payloads at the bottom)

Does this look like it would suit your needs?

Gauges

{
    "time": 1485314310.000,
    "event": "metric",
    "source": "<from_config>",
    "sourcetype": "<from_config>",
    "host": "<from_config>",
    "fields": {
        "metric_type": "gauge",
        // Metric names would be passthrough from input
        "metric_name:foo.cpu.user": 0.0,
        "metric_name:foo.cpu.system": 0.0,
        "metric_name:foo.cpu.idle": 0.0,
        "metric_name:bar.cpu.user": 0.0,
        "metric_name:bar.cpu.system": 0.0,
        "metric_name:bar.cpu.idle": 0.0,
        // ... etc
    }
}

Sets

{
    "time": 1485314310.000,
    "event": "metric",
    "source": "<from_config>",
    "sourcetype": "<from_config>",
    "host": "<from_config>",
    "fields": {
        "metric_type": "set",
        // Metric names would be passthrough from input
        "metric_name:foo.uniques": 98,
        "metric_name:bar.uniques": 127,
        // ... etc. 
    }
}

Counters

Each counter would result in two metrics. <metric_name>.count and <metric_name>.rate

{
    "time": 1485314310.000,
    "event": "metric",
    "source": "<from_config>",
    "sourcetype": "<from_config>",
    "host": "<from_config>",
    "fields": {
        "metric_type": "counter",
        "metric_name:foo.requests.count": 17046,
        "metric_name:foo.requests.rate": 1704.6,
        "metric_name:bar.requests.count": 12544,
        "metric_name:bar.requests.rate": 1254.4,
        // ... etc.
    }
}

Timers

These would output a full set of metrics derived from the input metric name. Example using metric name foo.duration

{
    "time": 1485314310.000,
    "event": "metric",
    "source": "<from_config>",
    "sourcetype": "<from_config>",
    "host": "<from_config>",
    "fields": {
        "metric_type": "timer",
        "metric_name:foo.duration.count_90": 304,
        "metric_name:foo.duration.mean_90": 143.07236842105263,
        "metric_name:foo.duration.upper_90": 280,
        "metric_name:foo.duration.sum_90": 43494,
        "metric_name:foo.duration.sum_squares_90": 8083406,
        "metric_name:foo.duration.std": 86.5952973729948,
        "metric_name:foo.duration.upper": 300,
        "metric_name:foo.duration.lower": 1,
        "metric_name:foo.duration.count": 338,
        "metric_name:foo.duration.count_ps": 33.8,
        "metric_name:foo.duration.sum": 53402,
        "metric_name:foo.duration.sum_squares": 10971776,
        "metric_name:foo.duration.mean": 157.9940828402367,
        "metric_name:foo.duration.median": 157.5,
        // histogram fields
        "metric_name:foo.duration.histogram.bin_50": 49,
        "metric_name:foo.duration.histogram.bin_100": 45,
        "metric_name:foo.duration.histogram.bin_150": 66,
        "metric_name:foo.duration.histogram.bin_200": 60,
        "metric_name:foo.duration.histogram.bin_inf": 118,
        // ...etc.
    }
}
lewismc commented 3 years ago

@emiller42 the above looks great. I assume that this is now built into master branch? If you can confirm then we can test out master branch with our StatsD server.

emiller42 commented 3 years ago

Not yet, I've got some local changes to implement it. Doing some cleanup and I'll have you take a look at the PR when ready.

emiller42 commented 3 years ago

this has been pushed and is available in v0.2.0.

lewismc commented 3 years ago

👍 Thank you @emiller42