BBOT 2.0 URL Excavation TODOs

TheTechromancer commented 5 days ago

The following are TODOs for our URL excavation:

[ ] Unify URL excavation into single excavator (no duplicate code between URL excavator + web param excavator)
- [ ] We have multiple yara rules that extract URLs. These should either be collapsed into a single rule, or else deduped as early as possible (before leaving excavate as events)
[x] Tests to make sure we're excavating query parameters
[ ] Tests to make sure we're excavating IPv6 URLs (all possible different formats, as suggested by @colin-stubbs)

liquidsec commented 4 days ago

"Tests to make sure we're excavating query parameters"

This exists, there are a number of tests with the prefix TestExcavateParameterExtraction that cover this.

liquidsec commented 4 days ago

As far as the first point, we've discussed this some offline, but i'll summarize a few points for consideration:

There is very little overlap, really only one YARA rule that crosses over between two. This is because most parameters are extracted in a way that doesn't touch the actual URL at all, for example in forms, in jquery calls, etc.
Parameter extraction has a lot more complexity, and also isn't on by default. This enables us to skip this complexity when we aren't doing any thing with WEB_PARAMETER.
It is extremely likely we'd actually add overall complexity by trying to merge the functionality. (As simple as possible URL extraction + As simple as possible Parameter extraction) < Very Complex Combined Extraction
The YARA rules are all compiled. This means the additional overhead by adding one YARA rule is very small, even if it is doing a very similar thing in one or two cases. The process of compilation minimizes this overhead.
Clear logical separation. Since URLs go to completely different event types than parameters, and have very different rules, separating their post-processing logic will make everything significantly more maintainable.
Slowed URL processing. URLs are handled more frequently, and adding parameter logic there means every URL extraction is going to take longer.

blacklanternsecurity / bbot