Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
9.21k stars 764 forks source link

bug/execution gets stuck #3756

Open jjovalle99 opened 3 weeks ago

jjovalle99 commented 3 weeks ago

Hi,

We tried to parse about 5,000 documents using the Unstructured Serverless API. Although the code doesn't generate a specific error message, it seems the execution is stuck—it hasn't made any progress in about 12 hours. Please take a look at the last lines of the logs:

INFO: HTTP Request: POST https://api.unstructuredapp.io/general/v0/general "HTTP/1.1 200 OK"
INFO: Successfully partitioned set #1, elements added to the final result.
INFO: Successfully partitioned set #2, elements added to the final result.
INFO: Successfully partitioned set #3, elements added to the final result.
INFO: Successfully partitioned set #4, elements added to the final result.
INFO: Successfully partitioned set #5, elements added to the final result.
2024-10-15 22:39:29,883 MainProcess INFO  partition finished in 3970.023798094s, attributes: file_id=e35763b61a2b
INFO: partition finished in 3970.023798094s, attributes: file_id=e35763b61a2b
INFO: HTTP Request: POST https://api.unstructuredapp.io/general/v0/general "HTTP/1.1 200 OK"
INFO: HTTP Request: POST https://api.unstructuredapp.io/general/v0/general "HTTP/1.1 200 OK"
INFO: Successfully partitioned set #1, elements added to the final result.
INFO: Successfully partitioned set #2, elements added to the final result.
2024-10-15 22:39:44,163 MainProcess INFO  partition finished in 3985.031034183s, attributes: file_id=15f63456b456
INFO: partition finished in 3985.031034183s, attributes: file_id=15f63456b456
INFO: HTTP Request: POST https://api.unstructuredapp.io/general/v0/general "HTTP/1.1 200 OK"
INFO: HTTP Request: POST https://api.unstructuredapp.io/general/v0/general "HTTP/1.1 200 OK"
INFO: HTTP Request: POST https://api.unstructuredapp.io/general/v0/general "HTTP/1.1 200 OK"
INFO: Successfully partitioned set #1, elements added to the final result.
INFO: Successfully partitioned set #2, elements added to the final result.
INFO: Successfully partitioned set #3, elements added to the final result.
INFO: Successfully partitioned set #4, elements added to the final result.
INFO: Successfully partitioned set #5, elements added to the final result.
INFO: Successfully partitioned set #6, elements added to the final result.
INFO: Successfully partitioned set #7, elements added to the final result.
INFO: Successfully partitioned set #8, elements added to the final result.
INFO: Successfully partitioned set #9, elements added to the final result.
INFO: Successfully partitioned set #10, elements added to the final result.
INFO: Successfully partitioned set #11, elements added to the final result.
2024-10-15 22:39:55,225 MainProcess INFO  partition finished in 3995.593149836s, attributes: file_id=258eaf3414a5
INFO: partition finished in 3995.593149836s, attributes: file_id=258eaf3414a5
INFO: HTTP Request: POST https://api.unstructuredapp.io/general/v0/general "HTTP/1.1 200 OK"
INFO: Successfully partitioned set #1, elements added to the final result.
INFO: Successfully partitioned set #2, elements added to the final result.
INFO: Successfully partitioned set #3, elements added to the final result.
INFO: Successfully partitioned set #4, elements added to the final result.
INFO: Successfully partitioned set #5, elements added to the final result.
INFO: Successfully partitioned set #6, elements added to the final result.
INFO: Successfully partitioned set #7, elements added to the final result.
INFO: Successfully partitioned set #8, elements added to the final result.
INFO: Successfully partitioned set #9, elements added to the final result.
INFO: Successfully partitioned set #10, elements added to the final result.
INFO: Successfully partitioned set #11, elements added to the final result.
INFO: Successfully partitioned set #12, elements added to the final result.
INFO: Successfully partitioned set #13, elements added to the final result.
INFO: Successfully partitioned set #14, elements added to the final result.
INFO: Successfully partitioned set #15, elements added to the final result.
2024-10-15 22:39:56,256 MainProcess INFO  partition finished in 3996.584588173s, attributes: file_id=b7f5de59f8b6
INFO: partition finished in 3996.584588173s, attributes: file_id=b7f5de59f8b6

As you can see, the last logs were from last night. Any ideas why is this happening? The following is the code I am using:

class DocumentParser(BaseModel):
    def create_pipeline(self, settings: Settings) -> Pipeline:
        connection_config: GcsConnectionConfig = GcsConnectionConfig(
            access_config=GcsAccessConfig(service_account_key=settings.gcp.SERVICE_ACCOUNT_FILE),
        )

        return Pipeline.from_configs(
            context=ProcessorConfig(),
            indexer_config=GcsIndexerConfig(remote_url=f"gs://{settings.gcp.INPUT_BUCKET}", recursive=True),
            downloader_config=GcsDownloaderConfig(),
            source_connection_config=connection_config,
            filterer_config=FiltererConfig(
                file_glob=[
                    "*.pdf",
                ],
            ),
            partitioner_config=PartitionerConfig(
                strategy="hi_res",
                partition_by_api=True,
                api_key=settings.unstructured.UNSTRUCTURED_API_KEY,
                partition_endpoint=settings.unstructured.UNSTRUCTURED_API_URL,
                additional_partition_args={
                    "split_pdf_page": True,
                    "split_pdf_allow_failed": True,
                    "split_pdf_concurrency_level": 15,
                    "extract_image_block_types": ["Image", "Table"],
                },
            ),
            chunker_config=ChunkerConfig(
                chunking_strategy="by_similarity",
                chunk_by_api=True,
                chunk_api_key=settings.unstructured.UNSTRUCTURED_API_KEY,
                chunking_endpoint=settings.unstructured.UNSTRUCTURED_API_URL,
                chunk_max_characters=1024,
            ),
            uploader_config=GcsUploaderConfig(
                remote_url=f"gs://{settings.gcp.OUTPUT_BUCKET}",
            ),
            destination_connection_config=connection_config,
        )

    def run(self) -> None:
        pipeline: Pipeline = self.create_pipeline()
        pipeline.run()

if __name__ == "__main__":
    from src.settings import settings

    document_parser: DocumentParser = DocumentParser()
    pipeline: Pipeline = document_parser.create_pipeline(settings=settings)
    pipeline.run()