Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
7.54k stars 595 forks source link

fix: remote root handlers when they exist #3128

Closed MthwRobinson closed 1 month ago

MthwRobinson commented 1 month ago

Summary

In some environments, such as Google Colab, loggers have a root handling that did not mask sensitive values. As a result, secrets such as API keys appeared in the logs. The PR removes root handlers when they exist to ensure sensitive values are handler properly.

Testing

Run the following in a Colab notebook. You should see two log outputs, one with the API key masked and one with it exposed.

!pip install unstructured
import logging
import json

from unstructured.ingest.interfaces import (
    ChunkingConfig,
    EmbeddingConfig,
    PartitionConfig,
    ProcessorConfig,
    ReadConfig,
)

partition_config = PartitionConfig(
        partition_by_api=True,
        api_key="super secret",

    )

from unstructured.ingest.logger import ingest_log_streaming_init
ingest_log_streaming_init(logging.INFO)

logger = logging.getLogger("unstructured.ingest")
logger.setLevel(logging.INFO)

logger.info(
 f"Running partition node to extract content from json files. "
 f"Config: {partition_config.to_json()}, "
)

Now replace the first cell with the following and rerun the Python code. Only the masked logging output should remain.

!git clone https://github.com/Unstructured-IO/unstructured.git && cd unstructured && git checkout fix/rm-log-dupes && pip install -e .