JSON lines aggregate results

edoardottt commented 1 year ago

This PR closes #115.

@ocervell what do u think?

This is a comparison test with the one shown in the issue:

{
  "url": "http://testphp.vulnweb.com/",
  "method": "GET",
  "status_code": 200,
  "words": 388,
  "lines": 110,
  "content_type": "text/html",
  "matches": {
    "infos": {
      "Email address": [
        "wvs@acunetix.com"
      ],
      "HTML comment": [
        "<!-- InstanceEndEditable -->",
        "<!-- here goes headers headers -->",
        "<!-- end masthead -->",
        "<!-- begin content -->",
        "<!--end content -->",
        "<!--end navbar -->",
        "<!-- InstanceEnd -->"
      ]
    }
  }
}

edoardottt commented 1 year ago

Thanks for the advice :)) appreciated!!

ocervell commented 11 months ago

Adding another two cents from my usage of cariddi recently:

I came across huge matches like:

[
  {
    "name":"PHP error",
    "match":"PHP error"
  },
  {
    "name":"MySQL error",
    "match":"warning_forbid_default_priv"<MORE THAN 20000 LINES HERE>"
  }
]

which completely destroy my terminal 😄

So we might think about either:

truncate the output a bit when matching a regex and maybe add a CLI flag / env variable to control the truncate character limit
improving regexes such that it doesn't match too much but only up to a few lines before / after (maybe up to the next newline but not sure how it would work for e.g Python tracebacks) -> i'm sure we can find a way to do better ;) Probably adding regex matching tests would help
adding which regex matched to the output - for instance MySQL error is comprised of multiple regexes and it would help to know which one of them matched

We could end up with a JSON format like:

[
  {
     "name": "MySQL Error",
     "results": [
        {
           "type": "Regex",
           "details": {"match": "Warning: ...<truncated_output>mysqli error: need new cache refresh... <truncated_output>", "regex": "(?i)Warning.*?mysqli?", "location": "line 42", "source": "body"}
        }
     ]
   }
]

Additionally, regexes have their limits - ideally we want to see one step further and create some kind of pattern-recognition algorithms, or using even using ML for this kind of tasks. It could be a good evolution for cariddi ;) The type key would be useful in that case to differenciate the matches from regex matches:

[
  {
    "type": "PatternFinder",
    "details": {"match": "Warning: ...<truncated_output>mysqli error: need new cache refresh... <truncated_output>", "matcher": "error-finder", "version": "2.0.1"}
  },
  {
    "type": "ML",
    "details": {"model_name": "my-awesome-ml-model", "version": "0.0.1"}
  }
]

There is also room to improve the findings by filtering which ones are found important or not, for instance:

an HTML comment containing "TODO / DO THIS LATER / PASSWORD / etc..." is important
an HTML comment containing a software version is important
an HTML comment like "\" is not important
an email starting with licensing@<domain> or sales@<domain> is very common and not very sensitive
an error / exception with an actual traceback is very sensitive etc...

Those "rules" could be first hardcoded by us on a case-by-case and then learned by ML as well at some point, and a severity field could be set for each finding.

There might be a need to create separate issues for some of those points since it's not directly linked to the JSON lines aggregation. Feel free to copy-paste some of my comments there.

edoardottt commented 11 months ago

Hi @ocervell .

I've thought a lil bit before commenting on this. Imo the best thing to do is this:

Close this issue #115 and close this PR too since the current implementation is better than the proposed one.
Open a discussion with your last comment so that me and you (and the gh community) can leave thoughts there since it's a better place to discuss about features

edoardottt / cariddi

JSON lines aggregate results #126