SigmaHQ / pySigma

Python library to parse and convert Sigma rules into queries (and whatever else you could imagine)
GNU Lesser General Public License v2.1
380 stars 95 forks source link

`SigmaString` class incorrectly escapes double backslashes #246

Closed majikman111 closed 1 month ago

majikman111 commented 1 month ago

When loading rules from yaml, the SigmaString class inconsistently escapes special characters, especially when strings are defined using single quotes. Given a rule such as

# test.yaml
title: test
logsource: 
    product: test
detection:
    selection:
        - conditionA: '\\' # String declared with single quotes containing multiple slashes.
        - conditionB: '\a\a'
    condition: selection

the loading process will generate a SigmaString for conditionA that is exactly 1 backslash long, where conditionB will generate a SigmaString that is four characters long. The expected behavior should be such that conditionA is loaded as a SigmaString containing 2 backslash characters. I believe this line of code could be changed as follows to correctly load multiple slash characters.

https://github.com/SigmaHQ/pySigma/blob/main/sigma/types.py#L127

                if (
                    c in char_mapping
                ):  # accumulate if character is special or escaping character

This is also inconsistent with the documentation for SigmaString.convert which states that only wildcard characters are escaped. https://sigmahq-pysigma.readthedocs.io/en/latest/Sigma_Rules.html#sigma.types.SigmaString.convert

The following snippet demonstrates the inconsistent character escaping.

import yaml
from sigma.collection import SigmaCollection
# Load yaml file from above
inp = yaml.safe_load(open("test.yaml"))
print(inp)
# {'title': 'test', 'logsource': {'product': 'test'}, 'detection': {'selection': [{'conditionA|contains': '\\\\'}, {'conditionB|contains': '\\a\\a'}], 'condition': 'selection'}}
# The conditionA string is loaded as two slashes (escaped in Python output)

collection = SigmaCollection.from_yaml(open("test.yaml"))

print(collection[0].detection.detections['selection'].detection_items[0].detection_items[0].value[0])
# *\*
# This should be 2 slashes

print(collection[0].detection.detections['selection'].detection_items[1].detection_items[0].value[0])
# *\a\a*

Additionally, the original correct string is copied to the original attribute, but is lost in the case of wildcarded queries (such as when using the contains modifier) when the wildcard character is added. If the original field were maintained during __add__ or other string operations, that would also be helpful.

thomaspatzke commented 1 month ago
# test.yaml
title: test
logsource: 
    product: test
detection:
    selection:
        - conditionA: '\\' # String declared with single quotes containing multiple slashes.
        - conditionB: '\a\a'
    condition: selection

the loading process will generate a SigmaString for conditionA that is exactly 1 backslash long, where conditionB will generate a SigmaString that is four characters long.

This works exactly as intended. The first backslash in conditionA escapes the second one into one literal backslash. In conditionB the backslashes have no escaping semantic because they are not followed by a wildcard or backslash and therefore are just passed as-is into the SigmaString.

The expected behavior should be such that conditionA is loaded as a SigmaString containing 2 backslash characters.

No, then the semantics get inconsistent because there's no clean possibility anymore to express a single literal \ followed by anything. \\* would then be a double backslash followed by a wildcard while \* would be a literal *.

This is also inconsistent with the documentation for SigmaString.convert which states that only wildcard characters are escaped. https://sigmahq-pysigma.readthedocs.io/en/latest/Sigma_Rules.html#sigma.types.SigmaString.convert

SigmaString.convert is for conversion of a SigmaString into the string contained in the converted query, it's not the parsing of a string into a SigmaString. Furthermore, the wildcard string escaping is the default behavior, but convert allows to add further characters to escape.

print(collection[0].detection.detections['selection'].detection_items[0].detection_items[0].value[0])

\

This should be 2 slashes

print on a SigmaString shows a representation that is not intended for any productive usage. To convert a SigmaString into something meaningful in the given query language, the aforementioned convert() method should be used.

Additionally, the original correct string is copied to the original attribute, but is lost in the case of wildcarded queries (such as when using the contains modifier) when the wildcard character is added. If the original field were maintained during __add__ or other string operations, that would also be helpful.

The main purpose of SigmaString is to maintain a parsed representation of strings where semantical wildcards and literal wildcard characters without special semantics can be clearly distinguished. In sigmac it was a huge mess because there wasn't such a possibility and each backend made its own interpretation of strings. The original attribute was initially only intended for debugging purposes and shouldn't be used. It's more likely that this attribute will be dropped in future releases due to such inconsistencies instead of implementing all SigmaString operations for it. Latter would be challenging to implement because of corner cases like SigmaString("\\").original + SigmaString("*").original.