Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
8.76k stars 716 forks source link

bug/API Fails out of the box ingesting a PDF - "File type application/octet-stream is not supported" #3673

Open ReMuSoMeGA93 opened 2 weeks ago

ReMuSoMeGA93 commented 2 weeks ago

Describe the bug

`2024-09-27 16:32:35,949 MainProcess INFO running local pipeline: index (LocalIndexer) -> download (LocalDownloader) -> partition (hi_res) -> upload (LocalUploader) with configs: {"reprocess": false, "verbose": false, "tqdm": false, "work_dir": "/Users/v/.cache/unstructured/ingest/pipeline", "num_processes": 2, "max_connections": null, "raise_on_error": false, "disable_parallelism": false, "preserve_downloads": false, "download_only": false, "re_download": false, "uncompress": false, "iter_delete": false, "delete_cache": false, "otel_endpoint": null, "status": {}} 2024-09-27 16:32:36,113 MainProcess INFO index finished in 0.000168s 2024-09-27 16:32:36,124 MainProcess INFO calling DownloadStep with 2 docs 2024-09-27 16:32:36,127 MainProcess INFO processing content async 2024-09-27 16:32:36,128 MainProcess WARNING async code being run in dedicated thread pool to not conflict with existing event loop: <_UnixSelectorEventLoop running=True closed=False debug=False> 2024-09-27 16:32:36,135 MainProcess INFO download finished in 0.004348s, attributes: file_id=d5e17f5c294b 2024-09-27 16:32:36,139 MainProcess INFO download finished in 0.002644s, attributes: file_id=45dc3406774d 2024-09-27 16:32:36,140 MainProcess INFO download step finished in 0.0167s 2024-09-27 16:32:36,141 MainProcess INFO calling PartitionStep with 2 docs 2024-09-27 16:32:36,142 MainProcess INFO processing content async 2024-09-27 16:32:36,143 MainProcess WARNING async code being run in dedicated thread pool to not conflict with existing event loop: <_UnixSelectorEventLoop running=True closed=False debug=False> INFO: Preparing to split document for partition. INFO: Given file doesn't have '.pdf' extension, so splitting is not enabled. INFO: Partitioning without split. 2024-09-27 16:32:36,158 MainProcess INFO partition finished in 0.007005s, attributes: file_id=45dc3406774d INFO: partition finished in 0.007005s, attributes: file_id=45dc3406774d ERROR: Failed to partition the document. 2024-09-27 16:32:36,748 MainProcess INFO partition finished in 0.602655s, attributes: file_id=d5e17f5c294b INFO: partition finished in 0.602655s, attributes: file_id=d5e17f5c294b 2024-09-27 16:32:36,750 MainProcess ERROR Exception raised while running partition Traceback (most recent call last): File "/Users/v/Developer/medgraph/.env12/lib/python3.12/site-packages/unstructured_ingest/v2/pipeline/interfaces.py", line 171, in run_async return await self._run_async(fn=fn, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/v/Developer/medgraph/.env12/lib/python3.12/site-packages/unstructured_ingest/v2/pipeline/steps/partition.py", line 66, in _run_async partitioned_content = await fn(fn_kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/Users/v/Developer/medgraph/.env12/lib/python3.12/site-packages/unstructured_ingest/v2/processes/partitioner.py", line 222, in run_async return await self.partition_via_api(filename, metadata=metadata, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/v/Developer/medgraph/.env12/lib/python3.12/site-packages/unstructured_ingest/utils/dep_check.py", line 50, in wrapper_async return await func(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/v/Developer/medgraph/.env12/lib/python3.12/site-packages/unstructured_ingest/v2/processes/partitioner.py", line 209, in partition_via_api resp = await self.call_api(client=client, request=partition_request) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/v/Developer/medgraph/.env12/lib/python3.12/site-packages/unstructured_ingest/v2/processes/partitioner.py", line 161, in call_api return await loop.run_in_executor(None, client.general.partition, request) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/Cellar/python@3.12/3.12.6/Frameworks/Python.framework/Versions/3.12/lib/python3.12/asyncio/futures.py", line 291, in await yield self # This tells Task to wait for completion. ^^^^^^^^^^ File "/usr/local/Cellar/python@3.12/3.12.6/Frameworks/Python.framework/Versions/3.12/lib/python3.12/asyncio/tasks.py", line 385, in __wakeup future.result() File "/usr/local/Cellar/python@3.12/3.12.6/Frameworks/Python.framework/Versions/3.12/lib/python3.12/asyncio/futures.py", line 203, in result raise self._exception.with_traceback(self._exception_tb) File "/usr/local/Cellar/python@3.12/3.12.6/Frameworks/Python.framework/Versions/3.12/lib/python3.12/concurrent/futures/thread.py", line 58, in run result = self.fn(self.args, self.kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/v/Developer/medgraph/.env12/lib/python3.12/site-packages/unstructured_client/general.py", line 102, in partition raise errors.SDKError('API error occurred', http_res.status_code, http_res.text, http_res) unstructured_client.models.errors.sdkerror.SDKError: API error occurred: Status 400 {"detail": "File type application/octet-stream is not supported."} ERROR: Exception raised while running partition Traceback (most recent call last): File "/Users/v/Developer/medgraph/.env12/lib/python3.12/site-packages/unstructured_ingest/v2/pipeline/interfaces.py", line 171, in run_async return await self._run_async(fn=fn, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/v/Developer/medgraph/.env12/lib/python3.12/site-packages/unstructured_ingest/v2/pipeline/steps/partition.py", line 66, in _run_async partitioned_content = await fn(fn_kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/Users/v/Developer/medgraph/.env12/lib/python3.12/site-packages/unstructured_ingest/v2/processes/partitioner.py", line 222, in run_async return await self.partition_via_api(filename, metadata=metadata, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/v/Developer/medgraph/.env12/lib/python3.12/site-packages/unstructured_ingest/utils/dep_check.py", line 50, in wrapper_async return await func(*args, *kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/v/Developer/medgraph/.env12/lib/python3.12/site-packages/unstructured_ingest/v2/processes/partitioner.py", line 209, in partition_via_api resp = await self.call_api(client=client, request=partition_request) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/v/Developer/medgraph/.env12/lib/python3.12/site-packages/unstructured_ingest/v2/processes/partitioner.py", line 161, in call_api return await loop.run_in_executor(None, client.general.partition, request) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/Cellar/python@3.12/3.12.6/Frameworks/Python.framework/Versions/3.12/lib/python3.12/asyncio/futures.py", line 291, in await yield self # This tells Task to wait for completion. ^^^^^^^^^^ File "/usr/local/Cellar/python@3.12/3.12.6/Frameworks/Python.framework/Versions/3.12/lib/python3.12/asyncio/tasks.py", line 385, in __wakeup future.result() File "/usr/local/Cellar/python@3.12/3.12.6/Frameworks/Python.framework/Versions/3.12/lib/python3.12/asyncio/futures.py", line 203, in result raise self._exception.with_traceback(self._exception_tb) File "/usr/local/Cellar/python@3.12/3.12.6/Frameworks/Python.framework/Versions/3.12/lib/python3.12/concurrent/futures/thread.py", line 58, in run result = self.fn(self.args, self.kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/v/Developer/medgraph/.env12/lib/python3.12/site-packages/unstructured_client/general.py", line 102, in partition raise errors.SDKError('API error occurred', http_res.status_code, http_res.text, http_res) unstructured_client.models.errors.sdkerror.SDKError: API error occurred: Status 400 {"detail":"File type application/octet-stream is not supported."} 2024-09-27 16:32:36,761 MainProcess INFO partition step finished in 0.619789s INFO: partition step finished in 0.619789s 2024-09-27 16:32:36,763 MainProcess INFO calling UploadStep with 1 docs INFO: calling UploadStep with 1 docs 2024-09-27 16:32:36,765 MainProcess INFO processing content across processes INFO: processing content across processes 2024-09-27 16:32:36,767 MainProcess INFO processing content serially INFO: processing content serially 2024-09-27 16:32:36,769 MainProcess WARNING async code being run in dedicated thread pool to not conflict with existing event loop: <_UnixSelectorEventLoop running=True closed=False debug=False> WARNING: async code being run in dedicated thread pool to not conflict with existing event loop: <_UnixSelectorEventLoop running=True closed=False debug=False> 2024-09-27 16:32:36,781 MainProcess INFO upload finished in 0.008855s, attributes: file_id=45dc3406774d INFO: upload finished in 0.008855s, attributes: file_id=45dc3406774d 2024-09-27 16:32:36,783 MainProcess INFO upload finished in 0.01359s, attributes: file_id=45dc3406774d INFO: upload finished in 0.01359s, attributes: file_id=45dc3406774d 2024-09-27 16:32:36,784 MainProcess INFO upload step finished in 0.021585s INFO: upload step finished in 0.021585s 2024-09-27 16:32:36,786 MainProcess INFO ingest process finished in 0.836632s INFO: ingest process finished in 0.836632s 2024-09-27 16:32:36,790 MainProcess ERROR 1 failed documents: ERROR: 1 failed documents: 2024-09-27 16:32:36,791 MainProcess ERROR /Users/v/.cache/unstructured/ingest/pipeline/index/d5e17f5c294b.json: [partition] API error occurred: Status 400 {"detail": "File type application/octet-stream is not supported."} ERROR: /Users/v/.cache/unstructured/ingest/pipeline/index/d5e17f5c294b.json: [partition] API error occurred: Status 400 {"detail":"File type application/octet-stream is not supported."}


PipelineError Traceback (most recent call last) Cell In[31], line 1 ----> 1 pipe.run()

File ~/Developer/medgraph/.env12/lib/python3.12/site-packages/unstructured_ingest/v2/pipeline/pipeline.py:143, in Pipeline.run(self) 141 self.cleanup() 142 if self.context.status: --> 143 raise PipelineError("Pipeline did not run successfully")

PipelineError: Pipeline did not run successfully` To Reproduce

    pipe = Pipeline.from_configs(
        context=ProcessorConfig(),
        indexer_config=LocalIndexerConfig(input_path='/Users/v/Developer/medgraph/pdfs/'),
        downloader_config=LocalDownloaderConfig(),
        source_connection_config=LocalConnectionConfig(),
        partitioner_config=PartitionerConfig(
            partition_by_api=True,
            api_key=api_key,
            partition_endpoint=api_url,
            strategy="hi_res",
            additional_partition_args={
                "split_pdf_page": True,
                "split_pdf_allow_failed": True,
                "split_pdf_concurrency_level": 15
            }
        ),
        uploader_config=LocalUploaderConfig(output_dir='/Users/v/Developer/medgraph/output')
    ).run()

My folder contains one single PDF file.

Expected behavior It should work, this code came directly from your documentation

Environment Info Python 3.12.6 Jupiter Notebook

Additional context Add any other context about the problem here.

shterjovad commented 1 week ago

I get the same error, but when I check in the output_dir the processed files are there.

ReMuSoMeGA93 commented 1 week ago

Its still failing for me for most PDFs. Definitely not a reliable solution at scale.

Luckily I found Adobe PDF Extract API is MUCH better and actually works if anyone is struggling with this.