New Module: Unstructured

blacklanternsecurity / bbot

A recursive internet scanner for hackers.

https://www.blacklanternsecurity.com/bbot/

GNU General Public License v3.0

4.02k stars 370 forks source link

New Module: Unstructured #1440

Closed domwhewell-sage closed 3 weeks ago

domwhewell-sage commented 4 weeks ago

Add unstructured as a module to extract text contents from many different file types and raise them as RAW_DATA events.

I have added FILESYSTEM and RAW_DATA to omit_event_types as alot of data can be raised this way and can flood your output files and are not particularly interesting events in by themselves.

The unstructured module will consume FILESYSTEM events tagged with either file or folder. Ones tagged as file will go straight to the extraction function and the RAW_DATA will be raised. Events tagged as folder will be crawled and interesting files (extensions) will be re-raised as files to be extracted. There is a list of ignore_folders as crawling this folder can raise a lot of links that linkback to the git_repo. (More can probably be added)

I have also added RAW_DATA to excavate so it will extract useful tidbits from files

domwhewell-sage commented 4 weeks ago

Just as a note this causes RAW_DATA to be printed in debug logs like

[DBUG] internal.excavate: Handling RAW_DATA("v7.3.0 (2019-05-31)

Improvements

Adds withCredentials prop to the TableAjax co...", module=unstructured, tags={'distance-1'})

Im not sure if thats an issue

TheTechromancer commented 4 weeks ago

Nice work here! @domwhewell-sage you're a machine. 🔥

To maximize CPU, can we isolate the unstructured call into its own staticmethod (including the import), and call it via self.scan.run_in_executor_mp?

As far as the logs, it's my fault for not wrapping that string in a repr(). Probably best to add it in event.__str__, to kill the newlines.

TheTechromancer commented 4 weeks ago

On a side note, I'm beginning to notice some of unstructured's other features, like its ability to extract "visible" text from HTML. That's actually something I've been looking for for a while, as a prerequisite to extracting entities (humans/companies/locations) from webpages.

from unstructured.partition.html import partition_html

url = "https://www.cnn.com/2023/01/30/sport/empire-state-building-green-philadelphia-eagles-spt-intl/index.html"
elements = partition_html(url=url)
print("\n\n".join([str(el) for el in elements]))

No need to add this right now but still pretty exciting; opens up some neat possibilities later on.

TheTechromancer commented 4 weeks ago

Can you also include this part inside the function?

with open(file_path, "rb") as file:
    return file.read().decode("utf-8", errors="ignore")

This will offload as much as possible since decoding can also be CPU-intensive.

domwhewell-sage commented 4 weeks ago

No worries I've ended up moving the whole extract_text function to be static. Ive also added in a new test to excavate to see if a URL is extracted from the downloaded PDF

TheTechromancer commented 3 weeks ago

Okay @domwhewell-sage one last thing before we merge; what are you thoughts on changing RAW_DATA to RAW_TEXT? Am I right in thinking it would always be text, and not binary data?

domwhewell-sage commented 3 weeks ago

I have no objections to it being RAW_TEXT as it will always be text not binary data. There wasn't any specific reason it was named as RAW_DATA tbh