Koen1999 commented 9 months ago

There are several issues with the Lucene backend currently that this PR fixes:

Queries containing whitespace matching against fields are quoted, which results in phrases in the context of Lucene. Phrases cannot contain wildcards, so this breaks other functionality. The solution implemented by this PR is to remove quotes, and to escape whitespace instead. Additionally changes are made to the pipeline such that matching is performed against keyword fields whenever a field type is considered to be a string by ElasticSearch. The only string field that is quoted in the solution is the empty string.
Search queries not matching any field are quoted at the moment. As a result, these search queries cannot contain wildcards, so this breaks other functionality. The solution is to replace quotes by wildcards.

Some of these issues were introduced by commit 83afccc4a13433d8e0aaa92985664140aea825e0 in an attempt to fix some of the problems mentioned in #15. This PR should also fix the issues mentioned in #28 and #36.

Attached to this PR, you can find several examples of Sigma rules, and how these are compiled to Lucene Queries. You will find that (given the correct mapping of fieldnames by the pipeline), these Lucene queries will work in accordance with the expectations set by the Sigma syntax.

sigma-rules.zip

Note: Since I do not have a ElasticSearch instance with similar field names as commonly resulting from WinLog Beat, I cannot check which fields are string fields and hence, which fields should be keyword fields in the pipeline. For similar reasons, the field names in the attached Lucene queries are slightly different. Other contributors should check that field names are mapped correctly.

Edit: I realized there is more pipelines that I never heard off. All fields marked as a string field by elasticsearch should also be mapped to the .keyword variant for these pipelines.

andurin commented 9 months ago

Merged. Thank you!

Koen1999 commented 9 months ago

Note: Since I do not have a ElasticSearch instance with similar field names as commonly resulting from WinLog Beat, I cannot check which fields are string fields and hence, which fields should be keyword fields in the pipeline. For similar reasons, the field names in the attached Lucene queries are slightly different. Other contributors should check that field names are mapped correctly.

@andurin, did you manage to check whether the field mappings were correct and complete? If you make a new release with incorrect mappings, things might break for users. The important thing is that all fields indexed as string by elasticsearch should use the .keyword subfield.

andurin commented 9 months ago

@Koen1999, that's my current headache issue - Datatyping here is a little bit frustrating.

ES Mapping and extra .keyword fields

Elastic doesn't really dictate which mappings one should use and its supposed to change the way their *beats are doing the mapping.

e.g. a index template from packetbeat 8.7.1:

            "command_line": {
              "fields": {
                "text": {
                  "norms": false,
                  "type": "text"
                }
              },
              "ignore_above": 1024,
              "type": "keyword"
            },

versus packetbeat 8.12.0:

            "command_line": {
              "fields": {
                "text": {
                  "type": "match_only_text"
                }
              },
              "type": "wildcard"
            },

After reviewing the packetbeat "default" template - I'll undo your changes to the pipeline. Those fields are already of type keyword or "wildcard" which is also a keyword family type (https://www.elastic.co/guide/en/elasticsearch/reference/current/keyword.html#wildcard-field-type).

But I guess there is enough room for "wrong" queries in the lucene backend I would like to cover with more and new testcases. I would like to invite you to the discussion - https://github.com/SigmaHQ/pySigma-backend-elasticsearch/discussions/46.

SigmaHQ / pySigma-backend-elasticsearch

Fixed issues with query strings containing spaces and/or wildcards for Lucene Backend #43

ES Mapping and extra .keyword fields