ALLOW_ALL causes that some fields are skipped

landauermax commented 3 years ago

I have the following sample json:

{
  "aa": "a1",
  "fields": {
    "bb": "b1",
    "cc": [
      "c1"
    ],
    "dd": "d1"
  }
}

And I use the following config:

LearnMode: True

LogResourceList:
  - "file:///home/ubuntu/sample.log"

Parser:

       - id: a
         type: VariableByteDataModelElement
         name: 'a'
         args: '.abcdefghijklmnopqrstuvwxyz1234567890ABCDEFGGHIJKLMNOPQRSTUVWXYZ'

       - id: b
         type: VariableByteDataModelElement
         name: 'b'
         args: '.abcdefghijklmnopqrstuvwxyz1234567890ABCDEFGGHIJKLMNOPQRSTUVWXYZ'

       - id: c
         type: VariableByteDataModelElement
         name: 'c'
         args: '.abcdefghijklmnopqrstuvwxyz1234567890ABCDEFGGHIJKLMNOPQRSTUVWXYZ'

       - id: d
         type: VariableByteDataModelElement
         name: 'd'
         args: '.abcdefghijklmnopqrstuvwxyz1234567890ABCDEFGGHIJKLMNOPQRSTUVWXYZ'

       - id: json
         start: True
         type: JsonModelElement
         name: 'model'
         key_parser_dict:
           "aa": a
           fields:
             "bb": b
             "cc":
              - c
             "dd": d

Input:
        timestamp_paths: None
        verbose: True
        json_format: True

EventHandlers:
        - id: stpe
          json: true
          type: StreamPrinterEventHandler

Note that each field is mapped with an element. As expected, each field is correctly represented in the output:

2021-06-30 07:08:50 New path(es) detected
NewMatchPathDetector: "DefaultNewMatchPathDetector" (1 lines)
  {
  "AnalysisComponent": {
    "AnalysisComponentIdentifier": 1,
    "AnalysisComponentType": "NewMatchPathDetector",
    "AnalysisComponentName": "DefaultNewMatchPathDetector",
    "Message": "New path(es) detected",
    "PersistenceFileName": "Default",
    "AffectedLogAtomPaths": [
      "/model",
      "/model/a",
      "/model/fields/b",
      "/model/fields/c",
      "/model/fields/d"
    ],
    "ParsedLogAtom": {
      "/model": {
        "aa": "a1",
        "fields": {
          "bb": "b1",
          "cc": [
            "c1"
          ],
          "dd": "d1"
        }
      },
      "/model/a": "a1",
      "/model/fields/b": "b1",
      "/model/fields/c": "c1",
      "/model/fields/d": "d1"
    }
  },
  "LogData": {
    "RawLogData": [
      "{\n  \"aa\": \"a1\",\n  \"fields\": {\n    \"bb\": \"b1\",\n    \"cc\": [\n      \"c1\"\n    ],\n    \"dd\": \"d1\"\n  }\n}"
    ],
    "Timestamps": [
      1625036930.55
    ],
    "DetectionTimestamp": 1625036930.55,
    "LogLinesCount": 1,
    "AnnotatedMatchElement": "/model: {'aa': 'a1', 'fields': {'bb': 'b1', 'cc': ['c1'], 'dd': 'd1'}}\n  /model/a: a1\n  /model/fields/b: b1\n  /model/fields/c: c1\n  /model/fields/d: d1"
  }
}

However, if I change the element of field "bb" to ALLOW_ALL, i.e.,

       - id: json
         start: True
         type: JsonModelElement
         name: 'model'
         key_parser_dict:
           "aa": a
           fields:
             "bb": ALLOW_ALL
             "cc":
              - c
             "dd": d

then the ouput changes as follows:

2021-06-30 07:11:41 New path(es) detected
NewMatchPathDetector: "DefaultNewMatchPathDetector" (1 lines)
  {
  "AnalysisComponent": {
    "AnalysisComponentIdentifier": 1,
    "AnalysisComponentType": "NewMatchPathDetector",
    "AnalysisComponentName": "DefaultNewMatchPathDetector",
    "Message": "New path(es) detected",
    "PersistenceFileName": "Default",
    "AffectedLogAtomPaths": [
      "/model",
      "/model/a",
      "/model/fields",
      "/model/fields/c"
    ],
    "ParsedLogAtom": {
      "/model": {
        "aa": "a1",
        "fields": {
          "bb": "b1",
          "cc": [
            "c1"
          ],
          "dd": "d1"
        }
      },
      "/model/a": "a1",
      "/model/fields": "b1",
      "/model/fields/c": "c1"
    }
  },
  "LogData": {
    "RawLogData": [
      "{\n  \"aa\": \"a1\",\n  \"fields\": {\n    \"bb\": \"b1\",\n    \"cc\": [\n      \"c1\"\n    ],\n    \"dd\": \"d1\"\n  }\n}"
    ],
    "Timestamps": [
      1625037101.33
    ],
    "DetectionTimestamp": 1625037101.33,
    "LogLinesCount": 1,
    "AnnotatedMatchElement": "/model: {'aa': 'a1', 'fields': {'bb': 'b1', 'cc': ['c1'], 'dd': 'd1'}}\n  /model/a: a1\n  /model/fields: b1\n  /model/fields/c: c1"
  }
}

Now the value "b1" is stored in "/model/fields", which is not intuitive. Since there is no user-defined name for an element when using ALLOW_ALL, we could store these values in "/model/fields/allow_all", where is a counter, or maybe in "/model/fields/", e.g., "/model/fields/bb" in this case.

And what is even more strange is that even though "c1" is still correctly stored in "/model/fields/c", the value "d1" and the path "/model/fields/d" have disappeared from the parsed element (see AffectedLogAtomPaths). I assume this is a bug that needs to be fixed.

ernstleierzopf commented 3 years ago

I have reproduced the results and came to the conclusion that there is a configuration error, no bug. ALLOW_ALL is not intended to be used for simple strings - instead only full objects and lists can be parsed with ALLOW_ALL.

I have changed the example config as follows:

       - id: json
         start: True
         type: JsonModelElement
         name: 'model'
         key_parser_dict:
           "aa": a
           fields:
             "bb": b
             "cc":
              - ALLOW_ALL
             "dd": d

Using ALLOW_ALL in the cc list works as intended with following results (note: I have already fixed the issue with the paths):

2021-07-12 12:18:38 New path(es) detected
NewMatchPathDetector: "DefaultNewMatchPathDetector" (1 lines)
  {
  "AnalysisComponent": {
    "AnalysisComponentIdentifier": 1,
    "AnalysisComponentType": "NewMatchPathDetector",
    "AnalysisComponentName": "DefaultNewMatchPathDetector",
    "Message": "New path(es) detected",
    "PersistenceFileName": "Default",
    "AffectedLogAtomPaths": [
      "/model",
      "/model/a",
      "/model/fields/b",
      "/model/fields/cc",
      "/model/fields/d"
    ],
    "ParsedLogAtom": {
      "/model": {
        "aa": "a1",
        "fields": {
          "bb": "b1",
          "cc": [
            "c1"
          ],
          "dd": "d1"
        }
      },
      "/model/a": "a1",
      "/model/fields/b": "b1",
      "/model/fields/cc": "c1",
      "/model/fields/d": "d1"
    }
  },
  "LogData": {
    "RawLogData": [
      "{\n  \"aa\": \"a1\",\n  \"fields\": {\n    \"bb\": \"b1\",\n    \"cc\": [\n      \"c1\"\n    ],\n    \"dd\": \"d1\"\n  }\n}"
    ],
    "Timestamps": [
      1626085118.79
    ],
    "DetectionTimestamp": 1626085118.79,
    "LogLinesCount": 1,
    "AnnotatedMatchElement": "/model: {'aa': 'a1', 'fields': {'bb': 'b1', 'cc': ['c1'], 'dd': 'd1'}}\n  /model/a: a1\n  /model/fields/b: b1\n  /model/fields/cc: c1\n  /model/fields/d: d1"
  }
}

landauermax commented 3 years ago

Okay, I was not aware of that. And I think it is not obvious that ALLOW_ALL can only be used in that way. So either we extend ALLOW_ALL to also work for strings (and any other data, i.e., have the same functionality as a AnyByteDataModelElement), or we make sure that unparsed atoms are generated when something that is not a string or object occur on an ALLOW_ALL field. What do you prefer?

ernstleierzopf commented 3 years ago

I have implemented the second option - only lists and objects are allowed. There is absolutely no reason to use ALLOW_ALL on simple strings and we should let the user know. There is always a possiblity to parse a string.

ait-aecid / logdata-anomaly-miner

ALLOW_ALL causes that some fields are skipped #775