Closed domwhewell-sage closed 3 weeks ago
Just as a note this causes RAW_DATA
to be printed in debug logs like
[DBUG] internal.excavate: Handling RAW_DATA("v7.3.0 (2019-05-31)
Improvements
Adds withCredentials prop to the TableAjax co...", module=unstructured, tags={'distance-1'})
Im not sure if thats an issue
Nice work here! @domwhewell-sage you're a machine. 🔥
To maximize CPU, can we isolate the unstructured call into its own staticmethod (including the import
), and call it via self.scan.run_in_executor_mp
?
As far as the logs, it's my fault for not wrapping that string in a repr()
. Probably best to add it in event.__str__
, to kill the newlines.
On a side note, I'm beginning to notice some of unstructured's other features, like its ability to extract "visible" text from HTML. That's actually something I've been looking for for a while, as a prerequisite to extracting entities (humans/companies/locations) from webpages.
from unstructured.partition.html import partition_html
url = "https://www.cnn.com/2023/01/30/sport/empire-state-building-green-philadelphia-eagles-spt-intl/index.html"
elements = partition_html(url=url)
print("\n\n".join([str(el) for el in elements]))
No need to add this right now but still pretty exciting; opens up some neat possibilities later on.
Can you also include this part inside the function?
with open(file_path, "rb") as file:
return file.read().decode("utf-8", errors="ignore")
This will offload as much as possible since decoding can also be CPU-intensive.
No worries I've ended up moving the whole extract_text
function to be static. Ive also added in a new test to excavate to see if a URL is extracted from the downloaded PDF
Okay @domwhewell-sage one last thing before we merge; what are you thoughts on changing RAW_DATA
to RAW_TEXT
? Am I right in thinking it would always be text, and not binary data?
I have no objections to it being RAW_TEXT
as it will always be text not binary data. There wasn't any specific reason it was named as RAW_DATA
tbh
Add unstructured as a module to extract text contents from many different file types and raise them as
RAW_DATA
events.I have added
FILESYSTEM
andRAW_DATA
toomit_event_types
as alot of data can be raised this way and can flood your output files and are not particularly interesting events in by themselves.The unstructured module will consume
FILESYSTEM
events tagged with eitherfile
orfolder
. Ones tagged asfile
will go straight to the extraction function and theRAW_DATA
will be raised. Events tagged asfolder
will be crawled and interesting files (extensions
) will be re-raised as files to be extracted. There is a list ofignore_folders
as crawling this folder can raise a lot of links that linkback to the git_repo. (More can probably be added)I have also added
RAW_DATA
to excavate so it will extract useful tidbits from files