blacklanternsecurity / bbot

A recursive internet scanner for hackers.
https://www.blacklanternsecurity.com/bbot/
GNU General Public License v3.0
4.65k stars 420 forks source link

[BBOT] Elastic Integration QOL Update #1232

Closed CarsonHrusovsky closed 6 months ago

CarsonHrusovsky commented 6 months ago

Hello!

I have nearly finished the integration for BBOT with Elastic. With these finishing touches, it has come to light that the field names we pull from the output.ndjson file are selectively capitalized.

Path: bbot/bbot/modules/output/json.py Branch: siem-friendly-json

Could it be possible for this to enter field names into that file in lowercase? Additionally is there an update on when this will reach production?

PR: https://github.com/elastic/integrations/pull/9427 (I will try and keep this issue post up to date).

The code can be seen here:

async def handle_event(self, event): event_json = dict(event) if self.siem_friendly: event_json["data"] = {event.type: event_json.pop("data", "")} event_str = json.dumps(event_json)

TheTechromancer commented 6 months ago

seim-friendly-json is already in stable. Which field names are causing the trouble?

CarsonHrusovsky commented 6 months ago

There are quite a few fields. In it's current rendition, any data: field is causing an issue. For example, if we have a vulnerability finding url, the JSON parser we use within elastic will create the field "data.VULNERABILITY.url". This is because within the the output.ndjson document, we see:

"data": {"VULNERABILITY": {"url": "someFinding.url"

It would work well for our integration if these field names were always lower case, like this:

"data": {"vulnerability": {"url":

If this is not a simple fix I can make it work for now, but as development on BBOT continues our integration will fall obsolete and require more regular updates.

TheTechromancer commented 6 months ago

Event types in BBOT are always uppercase. There shouldn't be any inconsistency in their capitalization; if there is it's a bug.

Is there a specific reason why you need them to be lowercase?

CarsonHrusovsky commented 6 months ago

I don't believe there is a bug, but I was afraid we might run into something like this. Elastic fields are always lowercase, as far as I can tell. This was specifically requested in our pull request. What I will do is just manually lowercase every field as part of the integration, but we will run into the issue I mentioned above. Hopefully community will maintain this and I will do my best. Thank you techro

TheTechromancer commented 6 months ago

Elastic fields aren't required to be lowercase, but it's a common convention, and I'm happy to add it as part of the siem_friendly setting.

On a side note, can you help me understand the need for manually defining all the BBOT event types?

Does elastic need a schema before injesting the data? Relying on static definitions like this will make it a lot of work to maintain and introduce problems if new event types are created or existing ones are renamed/restructured.

CarsonHrusovsky commented 6 months ago

Defining the types in this case is important for querying data and what you can do with that data. Importantly if in the future we run into a field that isn't mapped, it isn't the end of the world as the JSON parser will take a guess at what data type it is supposed to be. (At least this is how the JSON parser works today, I will admit in a official integration I am unsure if this behavior would be different). Regardless, I will plan to map every field that we use today, and will update as we regularly use this tool. This will at least keep it up to date with some consistency, but we will be using a specific scan so this may leave other users in the dust.

If a field is renamed that may cause a lot of issues and this is a problem we were aware of going into this. I will have to think if there is a better solution to that problem, or if there is a scripting mechanism we can include to handle this issue. I am going to ping a familiar face and get his input as well.

@nicpenning

nicpenning commented 6 months ago

Elastic fields aren't required to be lowercase, but it's a common convention, and I'm happy to add it as part of the siem_friendly setting.

On a side note, can you help me understand the need for manually defining all the BBOT event types?

Does elastic need a schema before injesting the data? Relying on static definitions like this will make it a lot of work to maintain and introduce problems if new event types are created or existing ones are renamed/restructured.

Any field can be dynamically mapped so it's not a requirement. However, if the fields are dynamically mapped to a data type that does t quite align such as a string vs IP Address or a keyword vs Date, then the visualization and search layers may not be as fruitful.

The integration will be built to handle any new fields/data.

Add far as upper vs lower, these fields can be mapped to lowercase versions on the Elastic side if it better conforms to their schema but shouldn't need to be done on the source if the types ate supposed to be all caps by design. It's actually pretty cool that the event type is in the field which makes it easy for us to distinguish event types between the results.

Anyways, 100% schema mapping isn't needed but long term it can pay dividends to slicing and dicing the data over time. As new fields are found on new versions data sets, the Elastic tooling for building integrations will find them easily.

Hope this helps! 🙏🏻

CarsonHrusovsky commented 6 months ago

Marking as resolved. Thanks techro

colin-stubbs commented 2 months ago

@CarsonHrusovsky if you're not currently working on it I'd like to add a HTTP JSON input option to the integration. Any overlap with what you're doing with it at the moment?

CarsonHrusovsky commented 2 months ago

Yes this has been something of interest as you would be able to track BBOT findings live. I haven't looked that far into it but should be able to use HTTP JSON input in Elastic with HTTP output in BBOT. (We had to make some changes to the JSON format in the past to get this to work with the original integration release. Unsure if we will run into the same issue here again). I am gearing up for 1.0 release of the integration here soonish and this is something I can try and add in as part of that. Noteworthy that BBOT 2.0 is looming and that will require breaking changes as well.

colin-stubbs commented 2 months ago

@CarsonHrusovsky - we can definitely do it, I already am just via a custom http_endpoint listener in the Elastic Agent policy, and the bbot http output module with basic auth (and obvs with siem_friendly==True).

No modifications necessary with bbot-1.x, though yes from the look of the new bbot-2.x events some adjustments might be required.

I've written a bunch of Elastic Agent integrations and am willing and able to add this, assuming you havn't already started on it? Either way happy to contribute, I've got a fair bit of experience writing ingest pipelines to manipulate data at ingest time.

CarsonHrusovsky commented 2 months ago

@colin-stubbs go for it - I will try and get myself in this space next week and see if I can finish the QoL updates I was looking into. Shouldn't impact anything on your end for bbot http output - let me know if I can assist further! :)

colin-stubbs commented 2 months ago

@CarsonHrusovsky

I've got this almost done, mostly everything appears to be fine.

There's one issue though, with a new field "discovery_path", which bbot v2.0 includes in siem_friendly=True JSON as an array/list of array/list.

This may or may not work in Elasticsearch itself, but in terms of building a valid Elastic Agent package for release I don't believe it will work as there's no way to define the field format for this kind of list inside list that I'm aware of.

e.g. this is the original bbot JSON event snippet, it's no go as best I can tell.

                "discovery_path": [
                    [
                        "DNS_NAME:1e57014aa7b0715bca68e4f597204fc4e1e851fc",
                        "Scan devious_edna seeded with DNS_NAME: blacklanternsecurity.com"
                    ],
                    [
                        "ORG_STUB:85519fabe82bd286159b7bfcf5c72139a563135b",
                        "speculated ORG_STUB: blacklanternsecurity"
                    ]
                ],

This is what I'm now converting it to,

                "discovery_path": {
                    "0": [
                        "DNS_NAME:1e57014aa7b0715bca68e4f597204fc4e1e851fc",
                        "Scan devious_edna seeded with DNS_NAME: blacklanternsecurity.com"
                    ],
                    "1": [
                        "ORG_STUB:85519fabe82bd286159b7bfcf5c72139a563135b",
                        "speculated ORG_STUB: blacklanternsecurity"
                    ]
                },

Using the following painless (which was extremely painful yet again) script,,

- script:
    description: Convert the outer array to an object
    if: ctx.bbot?.discovery_path != null
    lang: painless
    source: |
        Map _discovery_path = new HashMap();
        for (int i=0; i < ctx.bbot.discovery_path.length; i++) {
           _discovery_path[String.format('%d', new def[] {i})] = ctx.bbot.discovery_path[i];
        }
        ctx._discovery_path = _discovery_path;

- rename:
    field: _discovery_path
    target_field: bbot.discovery_path
    ignore_missing: true
    override: true

@TheTechromancer - FYI'ing you because of the array/list inside array/list issue... thoughts? Is the current format for discovery path set in stone? Could it be generated in a friendlier way instead?

TheTechromancer commented 2 months ago

The format is still open to change. Is it just that elastic's parsing doesn't support nested arrays? If so the best solution might be to break those into two separate arrays.

colin-stubbs commented 2 months ago

Elasticsearch will accept a document with it if there's no index template that defines the field path as being something else... but I suspect behaviour may be unpredictable, e.g. as below it's showing the source object correctly, but it has condensed the inner lists to a single list under fields.

{
  "_index": "test",
  "_id": "4MDJXJEBUtq80pfClLV2",
  "_version": 1,
  "_score": 0,
  "_source": {
    "json": {
      "discovery_path": [
        [
          "a",
          "b"
        ],
        [
          "c",
          "d"
        ]
      ]
    }
  },
  "fields": {
    "json.discovery_path": [
      "a",
      "b",
      "c",
      "d"
    ],
    "json.discovery_path.keyword": [
      "a",
      "b",
      "c",
      "d"
    ]
  }
}
colin-stubbs commented 2 months ago

Ok... actually, I may have found a way to define it. Testing.

colin-stubbs commented 2 months ago

Yeah... it's dirty, but we can just ignore the field type definition and "ignore" it for purposes of Elastic Agent package build.

- name: bbot.discovery_path
  ignore: true

I'll give that a whirl and a full index of real scan data to see if search functionality may work as expected.

TheTechromancer commented 2 months ago

Okay, let me know how it works. It's definitely not too late to change that format if we need to.

nicpenning commented 2 months ago

I would be surprised if the nested field type doesn't work.

Have you tried that?

https://www.elastic.co/guide/en/elasticsearch/reference/current/nested.html#:~:text=The%20nested%20type%20is%20a,queried%20independently%20of%20each%20other.

nicpenning commented 2 months ago

Also it appears arrays of arrays should just work: https://www.elastic.co/guide/en/elasticsearch/reference/current/array.html

colin-stubbs commented 2 months ago

Hi @nicpenning, no nested does not work. Nested implies that the list will have objects in it, rather than another list, so testing bombs when it discovers that there's no objects to be found.

TheTechromancer commented 2 months ago

I made a PR that separates the problematic list into two simple lists of strings: https://github.com/blacklanternsecurity/bbot/pull/1670.

Feel free to test it out and if it works well we can merge it.

nicpenning commented 2 months ago

It would probably be wise to have a discussion post or separate GH issue on 2.0 integration and http endpoint.

I imagine the changes to the data will need to be addressed in the current and future integration update so it can handle both gracefully.

colin-stubbs commented 2 months ago

There's one other issue, with .data.SCAN type conflicting between versions.

I've created a bug report here to consolidate: https://github.com/blacklanternsecurity/bbot/issues/1672

CarsonHrusovsky commented 3 weeks ago

Any updates here @colin-stubbs ?

colin-stubbs commented 1 week ago

@CarsonHrusovsky I did a shout out about this being available on at least one GitHub issue and on Discord but unfortunately forgot to post it here...

A fork including updated Elastic integration is here: https://github.com/routedlogic/integrations/tree/bbot-v2

I havn't touched it in a few weeks and do need to check if some of techromancer's changes have been merged. If they are there's just a slight tweak to the field types needed and I'll be able to submit a PR back to the official elastic integrations repo.

I landed on ensuring compatibility between bbot-v1.x and bbot-v2.x, and I've since been using my new integration in a production capacity, so fairly confident I've identified all of the field changes/etc.

TheTechromancer commented 1 week ago

@colin-stubbs fyi there are a few more minor changes to event fields coming in 2.1.0:

nicpenning commented 1 week ago

That's great!

@TheTechromancer, are there any docs containing these new event fields and every event field with their description?

TheTechromancer commented 1 week ago

Hmm I'm not sure but I'll make sure there are before we merge into stable.