Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
9.25k stars 767 forks source link

bug/application/octet-stream not supported #3677

Open jeremydiba opened 1 month ago

jeremydiba commented 1 month ago

Describe the bug When calling the API using a .tif file, received a bug "detail": "File type application/octet-stream is not supported."

To Reproduce My code snippet (to show params being passed)

    if "strategy" not in kwargs:
        kwargs["strategy"] = "auto"
    if "chunking_strategy" not in kwargs:
        kwargs["chunking_strategy"] = "by_title"
    if "combine_under_n_chars" not in kwargs and kwargs["chunking_strategy"] == "by_title":
        kwargs["combine_under_n_chars"] = 500
    if "coordinates" not in kwargs:
        kwargs["coordinates"] = True
    if "languages" not in kwargs:
        kwargs["languages"] = ["eng"]
    if "max_characters" not in kwargs:
        kwargs["max_characters"] = 4000
    if "unique_element_ids" not in kwargs:
        kwargs["unique_element_ids"] = False
    if "pdf" in filename and "split_pdf_page" not in kwargs:
        kwargs["split_pdf_page"] = True
    if "pdf" in filename and "split_pdf_concurrency_level" not in kwargs:
        kwargs["split_pdf_concurrency_level"] = 10
    if "include_orig_elements" not in kwargs:
        kwargs["include_orig_elements"] = True

    req = operations.PartitionRequest(
        partition_parameters=shared.PartitionParameters(
            files=shared.Files(
                content=data,
                file_name=filename,
            ),
            **kwargs)
    )

    res = client.general.partition(request=req)
    return res

Expected behavior Parsing of the document

Screenshots If applicable, add screenshots to help explain your problem.

Environment Info Python 3.11 runtime

Additional context Add any other context about the problem here.