Filebeat - nginx module - User agent information should also be indexed raw

praseodym commented 7 years ago

With Filebeat 6.0.0-alpha2 on Debian Stretch, the nginx module uses the Elasticsearch ingest-user-agent plugin to parse user agent strings and then remove the raw value. Unfortunately, the ingest-user-agent plugin is not capable of parsing more exotic user agent strings, causing information loss.

Because I think I'd be a pointless exercise to have ingest-user-agent parse every single user agent string in existence, I'd suggest keeping the raw user agent string around instead. I've found this information useful to identify rogue scanners, e.g. we've had some cases of a foreign OpenVAS scanner hitting our server with thousands of requests in a short timespan, causing increased webserver load. Identifying requests from the scanner through access logs indexed by Filebeat was quite hard because of the loss of user agent string information.

For example, this nginx access log line:

185.9.19.118 - - [18/Jun/2017:18:25:11 +0000] "GET /user?1964161297 HTTP/1.1" 200 43913 "-" "Mozilla/5.0 [en] (X11, U; OpenVAS 8.0.9)" "-"

gets indexed as follows, with all useful user agent information stripped away:

{
  "_index": "filebeat-6.0.0-alpha2-2017.06.18",
  "_type": "doc",
  "_id": "AVy8oG-fewUvHQSJ1Omw",
  "_score": 1,
  "_source": {
    "@timestamp": "2017-06-18T18:25:11.000Z",
    "offset": 5284698,
    "nginx": {
      "access": {
        "referrer": "-",
        "response_code": "200",
        "remote_ip": "185.9.19.118",
        "geoip": {
          "continent_name": "Europe",
          "city_name": "Vienna",
          "country_iso_code": "AT",
          "region_name": "Vienna",
          "location": {
            "lon": 16.35,
            "lat": 48.3
          }
        },
        "method": "GET",
        "user_name": "-",
        "http_version": "1.1",
        "body_sent": {
          "bytes": "43913"
        },
        "url": "/user?1964161297",
        "user_agent": {
          "os": "Other",
          "name": "Other",
          "os_name": "Other",
          "device": "Other"
        }
      }
    },
    "beat": {
      "hostname": "web-01",
      "name": "web-01",
      "version": "6.0.0-alpha2"
    },
    "prospector": {
      "type": "log"
    },
    "read_timestamp": "2017-06-18T19:14:00.405Z",
    "source": "/var/log/nginx/access.log"
  },
  "fields": {
    "@timestamp": [
      "2017-06-18T18:25:11.000Z"
    ]
  }
}

Edit: I think the same case can be made for the Apache2 module, but I have not tested it.

dbuelow commented 7 years ago

+1 This would be great, need the raw value.

praseodym commented 7 years ago

For what it's worth, you can already achieve this behaviour by manually editing the ingest pipeline that is created by Filebeat.

dbuelow commented 7 years ago

That's what i did and it works. But after an update to i.e 6.x the default pipeline i.e "filebeat-6.0-nginx-access-default" will be created and used. Therefore it would be great to keep the raw value to be upstream compatible.

vicmosin commented 6 years ago

Could one provide more details about workaround solution? I tried to use custom regex rules (from https://www.elastic.co/guide/en/elasticsearch/plugins/6.3/using-ingest-user-agent.html#_using_a_custom_regex_file) but it seems they are ignored

lnxg33k commented 6 years ago

That also happens with IIS module

elastic / beats

Filebeat - nginx module - User agent information should also be indexed raw #4521