Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
8.65k stars 704 forks source link

feat/databricks volumes src #3171

Closed rbiseck3 closed 1 month ago

rbiseck3 commented 3 months ago

Description

This adds in a source connector to list and download files from UC Volumes using the dbfs utilts in the SDK. Example:

from unstructured.ingest.v2.processes.connectors.databricks_volumes import (
    DatabricksVolumesDownloader,
    DatabricksVolumesIndexer,
    DatabricksVolumesIndexerConfig,
    DatabricksVolumesDownloaderConfig,
    DatabricksVolumesAccessConfig,
    DatabricksVolumesConnectionConfig,
)
import os
from pathlib import Path

connection_configs = DatabricksVolumesConnectionConfig(
    host=os.getenv("DATABRICKS_HOST"),
    access_config=DatabricksVolumesAccessConfig(
        token=os.getenv("DATABRICKS_TOKEN"),
    ),
)
indexer = DatabricksVolumesIndexer(
    connection_config=connection_configs,
    index_config=DatabricksVolumesIndexerConfig(
        remote_url="/Volumes/unstructured_solutions/unstructured_test_schema/unstructured-volume"
    ),
)
downloader = DatabricksVolumesDownloader(
    connection_config=connection_configs,
    download_config=DatabricksVolumesDownloaderConfig(
        download_dir=Path("/Users/romanisecke/Downloads/databricks-download")
    ),
)

for f in indexer.run():
    downloader.run(file_data=f)