ait-aecid / logdata-anomaly-miner

This tool parses log data and allows to define analysis pipelines for anomaly detection. It was designed to run the analysis with limited resources and lowest possible permissions to make it suitable for production server use.
GNU General Public License v3.0
80 stars 24 forks source link

ALLOW_ALL causes that some fields are skipped #775

Closed landauermax closed 3 years ago

landauermax commented 3 years ago

I have the following sample json:

{
  "aa": "a1",
  "fields": {
    "bb": "b1",
    "cc": [
      "c1"
    ],
    "dd": "d1"
  }
}

And I use the following config:

LearnMode: True

LogResourceList:
  - "file:///home/ubuntu/sample.log"

Parser:

       - id: a
         type: VariableByteDataModelElement
         name: 'a'
         args: '.abcdefghijklmnopqrstuvwxyz1234567890ABCDEFGGHIJKLMNOPQRSTUVWXYZ'

       - id: b
         type: VariableByteDataModelElement
         name: 'b'
         args: '.abcdefghijklmnopqrstuvwxyz1234567890ABCDEFGGHIJKLMNOPQRSTUVWXYZ'

       - id: c
         type: VariableByteDataModelElement
         name: 'c'
         args: '.abcdefghijklmnopqrstuvwxyz1234567890ABCDEFGGHIJKLMNOPQRSTUVWXYZ'

       - id: d
         type: VariableByteDataModelElement
         name: 'd'
         args: '.abcdefghijklmnopqrstuvwxyz1234567890ABCDEFGGHIJKLMNOPQRSTUVWXYZ'

       - id: json
         start: True
         type: JsonModelElement
         name: 'model'
         key_parser_dict:
           "aa": a
           fields:
             "bb": b
             "cc":
              - c
             "dd": d

Input:
        timestamp_paths: None
        verbose: True
        json_format: True

EventHandlers:
        - id: stpe
          json: true
          type: StreamPrinterEventHandler

Note that each field is mapped with an element. As expected, each field is correctly represented in the output:

2021-06-30 07:08:50 New path(es) detected
NewMatchPathDetector: "DefaultNewMatchPathDetector" (1 lines)
  {
  "AnalysisComponent": {
    "AnalysisComponentIdentifier": 1,
    "AnalysisComponentType": "NewMatchPathDetector",
    "AnalysisComponentName": "DefaultNewMatchPathDetector",
    "Message": "New path(es) detected",
    "PersistenceFileName": "Default",
    "AffectedLogAtomPaths": [
      "/model",
      "/model/a",
      "/model/fields/b",
      "/model/fields/c",
      "/model/fields/d"
    ],
    "ParsedLogAtom": {
      "/model": {
        "aa": "a1",
        "fields": {
          "bb": "b1",
          "cc": [
            "c1"
          ],
          "dd": "d1"
        }
      },
      "/model/a": "a1",
      "/model/fields/b": "b1",
      "/model/fields/c": "c1",
      "/model/fields/d": "d1"
    }
  },
  "LogData": {
    "RawLogData": [
      "{\n  \"aa\": \"a1\",\n  \"fields\": {\n    \"bb\": \"b1\",\n    \"cc\": [\n      \"c1\"\n    ],\n    \"dd\": \"d1\"\n  }\n}"
    ],
    "Timestamps": [
      1625036930.55
    ],
    "DetectionTimestamp": 1625036930.55,
    "LogLinesCount": 1,
    "AnnotatedMatchElement": "/model: {'aa': 'a1', 'fields': {'bb': 'b1', 'cc': ['c1'], 'dd': 'd1'}}\n  /model/a: a1\n  /model/fields/b: b1\n  /model/fields/c: c1\n  /model/fields/d: d1"
  }
}

However, if I change the element of field "bb" to ALLOW_ALL, i.e.,

       - id: json
         start: True
         type: JsonModelElement
         name: 'model'
         key_parser_dict:
           "aa": a
           fields:
             "bb": ALLOW_ALL
             "cc":
              - c
             "dd": d

then the ouput changes as follows:

2021-06-30 07:11:41 New path(es) detected
NewMatchPathDetector: "DefaultNewMatchPathDetector" (1 lines)
  {
  "AnalysisComponent": {
    "AnalysisComponentIdentifier": 1,
    "AnalysisComponentType": "NewMatchPathDetector",
    "AnalysisComponentName": "DefaultNewMatchPathDetector",
    "Message": "New path(es) detected",
    "PersistenceFileName": "Default",
    "AffectedLogAtomPaths": [
      "/model",
      "/model/a",
      "/model/fields",
      "/model/fields/c"
    ],
    "ParsedLogAtom": {
      "/model": {
        "aa": "a1",
        "fields": {
          "bb": "b1",
          "cc": [
            "c1"
          ],
          "dd": "d1"
        }
      },
      "/model/a": "a1",
      "/model/fields": "b1",
      "/model/fields/c": "c1"
    }
  },
  "LogData": {
    "RawLogData": [
      "{\n  \"aa\": \"a1\",\n  \"fields\": {\n    \"bb\": \"b1\",\n    \"cc\": [\n      \"c1\"\n    ],\n    \"dd\": \"d1\"\n  }\n}"
    ],
    "Timestamps": [
      1625037101.33
    ],
    "DetectionTimestamp": 1625037101.33,
    "LogLinesCount": 1,
    "AnnotatedMatchElement": "/model: {'aa': 'a1', 'fields': {'bb': 'b1', 'cc': ['c1'], 'dd': 'd1'}}\n  /model/a: a1\n  /model/fields: b1\n  /model/fields/c: c1"
  }
}

Now the value "b1" is stored in "/model/fields", which is not intuitive. Since there is no user-defined name for an element when using ALLOW_ALL, we could store these values in "/model/fields/allow_all", where is a counter, or maybe in "/model/fields/", e.g., "/model/fields/bb" in this case.

And what is even more strange is that even though "c1" is still correctly stored in "/model/fields/c", the value "d1" and the path "/model/fields/d" have disappeared from the parsed element (see AffectedLogAtomPaths). I assume this is a bug that needs to be fixed.

ernstleierzopf commented 3 years ago

I have reproduced the results and came to the conclusion that there is a configuration error, no bug. ALLOW_ALL is not intended to be used for simple strings - instead only full objects and lists can be parsed with ALLOW_ALL.

I have changed the example config as follows:

       - id: json
         start: True
         type: JsonModelElement
         name: 'model'
         key_parser_dict:
           "aa": a
           fields:
             "bb": b
             "cc":
              - ALLOW_ALL
             "dd": d

Using ALLOW_ALL in the cc list works as intended with following results (note: I have already fixed the issue with the paths):

2021-07-12 12:18:38 New path(es) detected
NewMatchPathDetector: "DefaultNewMatchPathDetector" (1 lines)
  {
  "AnalysisComponent": {
    "AnalysisComponentIdentifier": 1,
    "AnalysisComponentType": "NewMatchPathDetector",
    "AnalysisComponentName": "DefaultNewMatchPathDetector",
    "Message": "New path(es) detected",
    "PersistenceFileName": "Default",
    "AffectedLogAtomPaths": [
      "/model",
      "/model/a",
      "/model/fields/b",
      "/model/fields/cc",
      "/model/fields/d"
    ],
    "ParsedLogAtom": {
      "/model": {
        "aa": "a1",
        "fields": {
          "bb": "b1",
          "cc": [
            "c1"
          ],
          "dd": "d1"
        }
      },
      "/model/a": "a1",
      "/model/fields/b": "b1",
      "/model/fields/cc": "c1",
      "/model/fields/d": "d1"
    }
  },
  "LogData": {
    "RawLogData": [
      "{\n  \"aa\": \"a1\",\n  \"fields\": {\n    \"bb\": \"b1\",\n    \"cc\": [\n      \"c1\"\n    ],\n    \"dd\": \"d1\"\n  }\n}"
    ],
    "Timestamps": [
      1626085118.79
    ],
    "DetectionTimestamp": 1626085118.79,
    "LogLinesCount": 1,
    "AnnotatedMatchElement": "/model: {'aa': 'a1', 'fields': {'bb': 'b1', 'cc': ['c1'], 'dd': 'd1'}}\n  /model/a: a1\n  /model/fields/b: b1\n  /model/fields/cc: c1\n  /model/fields/d: d1"
  }
}
landauermax commented 3 years ago

Okay, I was not aware of that. And I think it is not obvious that ALLOW_ALL can only be used in that way. So either we extend ALLOW_ALL to also work for strings (and any other data, i.e., have the same functionality as a AnyByteDataModelElement), or we make sure that unparsed atoms are generated when something that is not a string or object occur on an ALLOW_ALL field. What do you prefer?

ernstleierzopf commented 3 years ago

I have implemented the second option - only lists and objects are allowed. There is absolutely no reason to use ALLOW_ALL on simple strings and we should let the user know. There is always a possiblity to parse a string.