Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
7.37k stars 572 forks source link

Return image data from confluence #3281

Open ML-Abdula opened 5 days ago

ML-Abdula commented 5 days ago
from unstructured.ingest.connector.confluence import ConfluenceAccessConfig, SimpleConfluenceConfig
from unstructured.ingest.interfaces import PartitionConfig, ProcessorConfig, ReadConfig
from unstructured.ingest.runner import ConfluenceRunner

if __name__ == "__main__":
    runner = ConfluenceRunner(
        processor_config=ProcessorConfig(
            verbose=True,
            output_dir="confluence-ingest-output",
            num_processes=2,
        ),
        read_config=ReadConfig(),
        partition_config=PartitionConfig(strategy="hi_res",pdf_infer_table_structure=True,
            metadata_exclude=["filename", "file_directory", "metadata.data_source.date_processed"],
        ),
        connector_config=SimpleConfluenceConfig(
            access_config=ConfluenceAccessConfig(
                api_token="api-key",
            ),
            user_email="my-email",
            url="url",
        ),
    )
   runner.run()

This returns a list of json with hierarchy but even with hi_res and pdf_infer_table_structure=True I'm unable to access any image data. All I get is textual data which is required but in my usecase I'm also looking for images from same document

ML-Abdula commented 5 days ago
2024-06-24 08:14:06,670 MainProcess DEBUG    updating download directory to: /root/.cache/unstructured/ingest/confluence/d78233987c
2024-06-24 08:14:06,674 MainProcess INFO     running pipeline: DocFactory -> Reader -> Partitioner -> Copier with config: {"reprocess": false, "verbose": true, "work_dir": "/root/.cache/unstructured/ingest/pipeline", "output_dir": "confluence-ingest-output2", "num_processes": 2, "raise_on_error": false}
2024-06-24 08:14:06,789 MainProcess INFO     Running doc factory to generate ingest docs. Source connector: {"processor_config": {"reprocess": false, "verbose": true, "work_dir": "/root/.cache/unstructured/ingest/pipeline", "output_dir": "confluence-ingest-output2", "num_processes": 2, "raise_on_error": false}, "read_config": {"download_dir": "/root/.cache/unstructured/ingest/confluence/d78233987c", "re_download": false, "preserve_downloads": false, "download_only": false, "max_docs": null}, "connector_config": {"user_email": "[emial], "access_config": {"api_token": "*******"}, "url": "*******", "max_num_of_spaces": 500, "max_num_of_docs_from_each_space": 100, "spaces": []}, "_confluence": null}
2024-06-24 08:14:21,820 MainProcess INFO     processing 155 docs via 2 processes
2024-06-24 08:14:21,879 MainProcess INFO     Calling Reader with 155 docs
2024-06-24 08:14:21,880 MainProcess INFO     Running source node to download data associated with ingest docs
2024-06-24 08:14:57,880 MainProcess INFO     Calling Partitioner with 155 docs
2024-06-24 08:14:57,882 MainProcess INFO     Running partition node to extract content from json files. Config: {"pdf_infer_table_structure": true, "strategy": "hi_res", "ocr_languages": null, "encoding": null, "additional_partition_args": {}, "skip_infer_table_types": null, "fields_include": ["element_id", "text", "type", "metadata", "embeddings"], "flatten_metadata": false, "metadata_exclude": ["filename", "file_directory", "metadata.data_source.date_processed"], "metadata_include": [], "partition_endpoint": "https://api.unstructured.io/general/v0/general", "partition_by_api": false, "api_key": "*******", "hi_res_model_name": null}, partition kwargs: {}]
2024-06-24 08:14:57,888 MainProcess INFO     Creating /root/.cache/unstructured/ingest/pipeline/partitioned
2024-06-24 08:15:00,732 MainProcess INFO     Calling Copier with 155 docs
2024-06-24 08:15:00,734 MainProcess INFO     Running copy node to move content to desired output location
ML-Abdula commented 4 days ago

@christinestraub @scanny anyone who can help me on this?

christinestraub commented 3 days ago

This returns a list of json with hierarchy but even with hi_res and pdf_infer_table_structure=True I'm unable to access any image data. All I get is textual data which is required but in my use case I'm also looking for images from same document

@ML-Abdula Do you mean you're unable to get any elements with category "Image" in the returned json? Can you please share the document you're trying to process?

scanny commented 3 days ago

@ML-Abdula Confluence is web-pages, right? So Confluence "documents" would go to partition_html().

HTML does not embed images, rather it contains <img href=...> "links" to images. partition_html() does not currently traverse those links to download images. Pretty sure the reason for that is the security risk inherent in downloading arbitrary image files.

So I think that explains why no Image elements are present in the output for the Confluence connector. You could suggest an enhancement. Perhaps there's a way to let you download the images yourself or perhaps to identify trusted zones or something. That should be in a separate issue though so it can be discussed independently.