Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
8.25k stars 669 forks source link

feat/Decouple Partitioning User API and Implementation #1521

Closed newelh closed 6 days ago

newelh commented 11 months ago

Description

The user API is tightly bound to the internal implementation of the partitioners. As a result, changes in keyword arguments or function behavior immediately impact the user API. Users have direct access to the implementation functions, as indicated in the current documentation. This tight coupling restricts our flexibility to refactor or enhance the partitioners without introducing breaking changes.

The proposed changes are intended to increase the speed of development and eventually provide the ability to introduce new user facing APIs in the future.

Proposed Solution

Introduce an abstraction layer between the user API and the partitioner implementations. This layer would expose only the necessary functionalities and delegate to the appropriate partitioners. This decoupling allows for more flexibility in internal changes without affecting the user API. A phased approach would be needed, where this issue would be the first step of the following:

  1. Move the contents of the partition function for each partitioner into a private function _partition_{type}. For example, the email partitioner. While this seems trivial, it will enable us to further develop the partitioners without obstructing APIs that users may have in production. It will enable us to simplify and consolidate functionality in the partitioners so that we can eventually support a simpler interface for users in the future.
    
    @process_metadata()
    @add_metadata_with_filetype(FileType.EML)
    @add_chunking_strategy()
    def partition_email(
    filename: Optional[str] = None,
    file: Optional[Union[IO[bytes], SpooledTemporaryFile]] = None,
    text: Optional[str] = None,
    content_source: str = "text/html",
    encoding: Optional[str] = None,
    include_headers: bool = False,
    max_partition: Optional[int] = 1500,
    include_metadata: bool = True,
    metadata_filename: Optional[str] = None,
    metadata_last_modified: Optional[str] = None,
    process_attachments: bool = False,
    attachment_partitioner: Optional[Callable] = None,
    min_partition: Optional[int] = 0,
    chunking_strategy: Optional[str] = None,
    **kwargs,
    ) -> List[Element]:
    """ Wrap the _partition_email function to separate user facing API from internal API"""
    return _partition_email(
        filename=filename,
        file=file,
        text=text,
        content_source=content_source,
        encoding=encoding,
        include_headers=include_headers,
        max_partition=max_partition,
        include_metadata=include_metadata,
        metadata_filename=metadata_filename,
        metadata_last_modified=metadata_last_modified,
        process_attachments=process_attachments,
        attachment_partitioner=attachment_partitioner,
        min_partition=min_partition,
        chunking_strategy=chunking_strategy,
        **kwargs,
    )

Follow-up phases/issues after completing phase 1:
2. Extract file reading into it's own function that passes a stream to the private partition functions
3. Move metadata processing from a decorator to a function call before/after partitioning. This allows us to support preprocessing steps for handling user input (e.g. language) and post-processing steps for consolidating processed element data (e.g. hierarchy, chunking) in a decoupled interface that can be better unit tested. 
4. Move chunking from a decorator to a function call after partitioning
5. Write a new user facing api for accessing the simpler, decoupled components

All of this can be done incrementally, per partitioner. The end goal is to reduce the complexity of the keyword arguments in the user-facing API and reduce the coupling of downstream components for further feature development. In particular for document pre/post processing: metadata such as language and hierarchy, chunking, and document level metadata such as encoding, content source.

## Alternatives
Do Nothing: Leave it as it is and continue with the tightly coupled system, accepting the maintenance and extensibility costs.

## Additional Context
Unstructured already has an abstraction layer to some degree with the auto partitioner, the documentation does advocate though for using document-specific partitioners for increased speed, fewer dependencies and additional features. 
scanny commented 11 months ago

A couple very nice concrete benefits of this that I see:

  1. Ease unit-testing. Decorators complicate unit testing because the "inner" function cannot be isolated from the decorator function. So the only alternative is to test the entire "glued-together" composition. Having a facade API function allows the core partitioner to be tested in isolation from the decorated API function.
  2. Allow flexible composition of implementation from distinct units. An API function naturally tends to become "fat" with keyword options as functionality is added. This is good for the user (if designed carefully) providing them a lot of flexibility. However, not all options are necessarily used by all parts of the implemention. We can see this already where some kwargs are used by decorators only. A distinct API function allows the implementation to be flexibly composed and recomposed from a growing set of distinct "step" or "aspect" implementor functions, each taking only the arguments they require. This improves our ability to respond to newly added function without struggling with a rigid implementation.
  3. Provide code-tree shape flexibility. As partitioner function modules begin to approach 1000 lines, it becomes time to think about transitioning each into a subpackage (directory). The distinct and smallish API function can be placed in the init.py module and the supporting implementation distributed to submodules in that directory.

I think other benefits are going to occur to me, but that's a start :)

newelh commented 11 months ago

Related: #1520

newelh commented 11 months ago

Related PR #1527