edoardottt / cariddi

Take a list of domains, crawl urls and scan for endpoints, secrets, api keys, file extensions, tokens and more
https://edoardoottavianelli.it
GNU General Public License v3.0
1.49k stars 152 forks source link

JSON lines aggregate results #126

Closed edoardottt closed 11 months ago

edoardottt commented 1 year ago

This PR closes #115.

@ocervell what do u think?

This is a comparison test with the one shown in the issue:

{
  "url": "http://testphp.vulnweb.com/",
  "method": "GET",
  "status_code": 200,
  "words": 388,
  "lines": 110,
  "content_type": "text/html",
  "matches": {
    "infos": {
      "Email address": [
        "wvs@acunetix.com"
      ],
      "HTML comment": [
        "<!-- InstanceEndEditable -->",
        "<!-- here goes headers headers -->",
        "<!-- end masthead -->",
        "<!-- begin content -->",
        "<!--end content -->",
        "<!--end navbar -->",
        "<!-- InstanceEnd -->"
      ]
    }
  }
}
edoardottt commented 1 year ago

Thanks for the advice :)) appreciated!!

ocervell commented 11 months ago

Adding another two cents from my usage of cariddi recently:

I came across huge matches like:

[
  {
    "name":"PHP error",
    "match":"PHP error"
  },
  {
    "name":"MySQL error",
    "match":"warning_forbid_default_priv"<MORE THAN 20000 LINES HERE>"
  }
]

which completely destroy my terminal 😄

So we might think about either:

We could end up with a JSON format like:

[
  {
     "name": "MySQL Error",
     "results": [
        {
           "type": "Regex",
           "details": {"match": "Warning: ...<truncated_output>mysqli error: need new cache refresh... <truncated_output>", "regex": "(?i)Warning.*?mysqli?", "location": "line 42", "source": "body"}
        }
     ]
   }
]

Additionally, regexes have their limits - ideally we want to see one step further and create some kind of pattern-recognition algorithms, or using even using ML for this kind of tasks. It could be a good evolution for cariddi ;) The type key would be useful in that case to differenciate the matches from regex matches:

[
  {
    "type": "PatternFinder",
    "details": {"match": "Warning: ...<truncated_output>mysqli error: need new cache refresh... <truncated_output>", "matcher": "error-finder", "version": "2.0.1"}
  },
  {
    "type": "ML",
    "details": {"model_name": "my-awesome-ml-model", "version": "0.0.1"}
  }
]

There is also room to improve the findings by filtering which ones are found important or not, for instance:

Those "rules" could be first hardcoded by us on a case-by-case and then learned by ML as well at some point, and a severity field could be set for each finding.

There might be a need to create separate issues for some of those points since it's not directly linked to the JSON lines aggregation. Feel free to copy-paste some of my comments there.

edoardottt commented 11 months ago

Hi @ocervell .

I've thought a lil bit before commenting on this. Imo the best thing to do is this: