Open shriharshan opened 2 months ago
Hi @shriharshan , i'd suggest 1/ lowering the concurrency level, say to 5 from 10
"split_pdf_concurrency_level": 5
, assuming the requests overloaded the number of API VMs able to serve the response. There are other ways to address the issue though: 2/ Increase minimum API VMs running in the Auto Scaling Group to immediately handle the needed concurrency. Please be aware that though that since the product is charged by instance hour, this would result in a higher cost per hour. 3/ Increase the retry timeout parameters, mentioned in the README here. https://github.com/Unstructured-IO/unstructured-python-client?tab=readme-ov-file#retries: with something like:
retry_config = utils.RetryConfig(
"backoff",
utils.BackoffStrategy(
initial_interval=3000, # 3 seconds
max_interval=1000 * 60 * 12, # 12 minutes
exponent=1.88,
max_elapsed_time=1000 * 60 * 60 # 60 minutes
),
retry_connection_errors=True
)
. This would allow more time for a fixed pool of VMs to handle the request, or allow time for autoscaling in an Auto Scaling group to take effect.
I am using Unstructured API_URL from aws marketplace and When I am trying to extract data from the files, it is giving me 502 gate way error, I setup everything in was according to the documentation.
import os from unstructured_ingest.v2.pipeline.pipeline import Pipeline from unstructured_ingest.v2.interfaces import ProcessorConfig from unstructured_ingest.v2.processes.connectors.local import ( LocalIndexerConfig, LocalDownloaderConfig, LocalConnectionConfig, LocalUploaderConfig ) from unstructured_ingest.v2.processes.partitioner import PartitionerConfig Pipeline.from_configs( context=ProcessorConfig(tqdm=True), indexer_config=LocalIndexerConfig(input_path="/home/shriharshan/Dynamic_Rag-main/Data", recursive=True), downloader_config=LocalDownloaderConfig(), source_connection_config=LocalConnectionConfig(), partitioner_config=PartitionerConfig( partition_by_api=True, api_key="", partition_endpoint="API_URL", strategy="hi_res", additional_partition_args={ "split_pdf_page": True, "split_pdf_allow_failed": True, "split_pdf_concurrency_level": 10 }, ), uploader_config=LocalUploaderConfig(output_dir="Output") ).run()
2024-09-23 17:32:34,183 MainProcess INFO created index with configs: {"input_path": "/home/shriharshan/Dynamic_Rag-main/Data", "recursive": true}, connection configs: {"access_config": "**"} 2024-09-23 17:32:34,186 MainProcess INFO Created download with configs: {"download_dir": null}, connection configs: {"access_config": "**"} 2024-09-23 17:32:34,187 MainProcess INFO created partition with configs: {"strategy": "hi_res", "ocr_languages": null, "encoding": null, "additional_partition_args": {"split_pdf_page": true, "split_pdf_allow_failed": true, "split_pdf_concurrency_level": 10}, "skip_infer_table_types": null, "fields_include": ["element_id", "text", "type", "metadata", "embeddings"], "flatten_metadata": false, "metadata_exclude": [], "metadata_include": [], "partition_endpoint": "API_URL", "partition_by_api": true, "api_key": "", "hi_res_model_name": null} 2024-09-23 17:32:34,188 MainProcess INFO Created upload with configs: {"output_dir": "Output"}, connection configs: {"access_config": "**"} 2024-09-23 17:32:34,189 MainProcess INFO running local pipeline: index (LocalIndexer) -> download (LocalDownloader) -> partition (hi_res) -> upload (LocalUploader) with configs: {"reprocess": false, "verbose": false, "tqdm": true, "work_dir": "/home/shriharshan/.cache/unstructured/ingest/pipeline", "num_processes": 2, "max_connections": null, "raise_on_error": false, "disable_parallelism": false, "preserve_downloads": false, "download_only": false, "re_download": false, "uncompress": false, "iter_delete": false, "delete_cache": false, "otel_endpoint": null, "status": {}} 2024-09-23 17:32:34,209 MainProcess INFO index finished in 0.000119656s 2024-09-23 17:32:34,213 MainProcess WARNING Couldn't detect date created: 'os.stat_result' object has no attribute 'st_birthtime' 2024-09-23 17:32:34,219 MainProcess WARNING Couldn't detect date created: 'os.stat_result' object has no attribute 'st_birthtime' 2024-09-23 17:32:34,222 MainProcess WARNING Couldn't detect date created: 'os.stat_result' object has no attribute 'st_birthtime' 2024-09-23 17:32:34,224 MainProcess WARNING Couldn't detect date created: 'os.stat_result' object has no attribute 'st_birthtime' 2024-09-23 17:32:34,226 MainProcess WARNING Couldn't detect date created: 'os.stat_result' object has no attribute 'st_birthtime' 2024-09-23 17:32:34,229 MainProcess WARNING Couldn't detect date created: 'os.stat_result' object has no attribute 'st_birthtime' 2024-09-23 17:32:34,231 MainProcess WARNING Couldn't detect date created: 'os.stat_result' object has no attribute 'st_birthtime' 2024-09-23 17:32:34,234 MainProcess WARNING Couldn't detect date created: 'os.stat_result' object has no attribute 'st_birthtime' 2024-09-23 17:32:34,237 MainProcess INFO calling DownloadStep with 8 docs 2024-09-23 17:32:34,238 MainProcess INFO processing content async 2024-09-23 17:32:34,239 MainProcess WARNING async code being run in dedicated thread pool to not conflict with existing event loop: <_UnixSelectorEventLoop running=True closed=False debug=False> download: 0%| | 0/8 [00:00<?, ?it/s]2024-09-23 17:32:34,248 MainProcess INFO download finished in 0.002884459s, attributes: file_id=537ea6e7ba3d 2024-09-23 17:32:34,252 MainProcess INFO download finished in 0.003342605s, attributes: file_id=6ce504f445a3 2024-09-23 17:32:34,258 MainProcess INFO download finished in 0.002855875s, attributes: file_id=a2af29c24895 2024-09-23 17:32:34,261 MainProcess INFO download finished in 0.002308752s, attributes: file_id=28d75cffc4b9 2024-09-23 17:32:34,265 MainProcess INFO download finished in 0.002868329s, attributes: file_id=111eb8aa15ee 2024-09-23 17:32:34,271 MainProcess INFO download finished in 0.004132718s, attributes: file_id=b6456f1834a2 2024-09-23 17:32:34,276 MainProcess INFO download finished in 0.002465007s, attributes: file_id=f75fd99fc781 2024-09-23 17:32:34,279 MainProcess INFO download finished in 0.002334571s, attributes: file_id=e9c24f7bc343 download: 100%|██████████| 8/8 [00:00<00:00, 221.89it/s] 2024-09-23 17:32:34,283 MainProcess INFO download step finished in 0.045320991s 2024-09-23 17:32:34,284 MainProcess INFO calling PartitionStep with 8 docs 2024-09-23 17:32:34,286 MainProcess INFO processing content async 2024-09-23 17:32:34,287 MainProcess WARNING async code being run in dedicated thread pool to not conflict with existing event loop: <_UnixSelectorEventLoop running=True closed=False debug=False> partition: 0%| | 0/8 [00:00<?, ?it/s]INFO: Preparing to split document for partition. INFO: Starting page number set to 1 INFO: Allow failed set to 1 INFO: Preparing to split document for partition. INFO: Concurrency level set to 10 INFO: Starting page number set to 1 INFO: Allow failed set to 1 INFO: Concurrency level set to 10 INFO: Splitting pages 1 to 21 (21 total) INFO: Splitting pages 1 to 3 (3 total) INFO: Determined optimal split size of 3 pages. INFO: Preparing to split document for partition. INFO: Determined optimal split size of 2 pages. INFO: Partitioning 7 files with 3 page(s) each. INFO: Starting page number set to 1 INFO: Partitioning 1 files with 2 page(s) each. INFO: Preparing to split document for partition. INFO: Allow failed set to 1 INFO: Partitioning 1 file with 1 page(s). INFO: Partitioning set #1 (pages 1-3). INFO: Starting page number set to 1 INFO: Concurrency level set to 10 INFO: Preparing to split document for partition. INFO: Partitioning set #1 (pages 1-2). INFO: Partitioning set #2 (pages 4-6). INFO: Allow failed set to 1 INFO: Splitting pages 1 to 19 (19 total) INFO: Starting page number set to 1 INFO: Partitioning set #2 (pages 3-3). INFO: Partitioning set #3 (pages 7-9). INFO: Preparing to split document for partition. INFO: Concurrency level set to 10 INFO: Determined optimal split size of 2 pages. INFO: Allow failed set to 1 INFO: Partitioning set #4 (pages 10-12). INFO: Preparing to split document for partition. INFO: Starting page number set to 1 INFO: Partitioning 9 files with 2 page(s) each. INFO: Splitting pages 1 to 19 (19 total) INFO: Concurrency level set to 10 INFO: Partitioning set #5 (pages 13-15). INFO: Allow failed set to 1 INFO: Starting page number set to 1 INFO: Partitioning 1 file with 1 page(s). INFO: Preparing to split document for partition. INFO: Determined optimal split size of 2 pages. INFO: Splitting pages 1 to 19 (19 total) INFO: Partitioning set #6 (pages 16-18). INFO: Concurrency level set to 10 INFO: Allow failed set to 1 INFO: Starting page number set to 1 INFO: Partitioning set #1 (pages 1-2). INFO: Partitioning 9 files with 2 page(s) each. INFO: Determined optimal split size of 2 pages. INFO: Partitioning set #7 (pages 19-21). INFO: Splitting pages 1 to 17 (17 total) INFO: Concurrency level set to 10 INFO: Allow failed set to 1 INFO: Partitioning set #2 (pages 3-4). INFO: Partitioning 1 file with 1 page(s). INFO: Partitioning 9 files with 2 page(s) each. INFO: Determined optimal split size of 2 pages. INFO: Concurrency level set to 10 INFO: Splitting pages 1 to 23 (23 total) INFO: Partitioning set #3 (pages 5-6). INFO: Partitioning 1 file with 1 page(s). INFO: Partitioning set #1 (pages 1-2). INFO: Partitioning 8 files with 2 page(s) each. INFO: Splitting pages 1 to 3 (3 total) INFO: Determined optimal split size of 3 pages. INFO: Partitioning set #4 (pages 7-8). INFO: Partitioning set #1 (pages 1-2). INFO: Partitioning set #2 (pages 3-4). INFO: Partitioning 1 file with 1 page(s). INFO: Determined optimal split size of 2 pages. INFO: Partitioning 7 files with 3 page(s) each. INFO: Partitioning set #5 (pages 9-10). INFO: Partitioning set #2 (pages 3-4). INFO: Partitioning set #3 (pages 5-6). INFO: Partitioning 1 files with 2 page(s) each. INFO: Partitioning set #1 (pages 1-2). INFO: Partitioning 1 file with 2 page(s). INFO: Partitioning set #6 (pages 11-12). INFO: Partitioning set #3 (pages 5-6). INFO: Partitioning set #4 (pages 7-8). INFO: Partitioning 1 file with 1 page(s). INFO: Partitioning set #2 (pages 3-4). INFO: Partitioning set #1 (pages 1-3). INFO: Partitioning set #7 (pages 13-14). INFO: Partitioning set #4 (pages 7-8). INFO: Partitioning set #5 (pages 9-10). INFO: Partitioning set #1 (pages 1-2). INFO: Partitioning set #3 (pages 5-6). INFO: Partitioning set #2 (pages 4-6). INFO: Partitioning set #8 (pages 15-16). INFO: Partitioning set #5 (pages 9-10). INFO: Partitioning set #6 (pages 11-12). INFO: Partitioning set #2 (pages 3-3). INFO: Partitioning set #4 (pages 7-8). INFO: Partitioning set #3 (pages 7-9). INFO: Partitioning set #9 (pages 17-18). INFO: Partitioning set #6 (pages 11-12). INFO: Partitioning set #7 (pages 13-14). INFO: Partitioning set #5 (pages 9-10). INFO: Partitioning set #4 (pages 10-12). INFO: Partitioning set #10 (pages 19-19). INFO: Partitioning set #7 (pages 13-14). INFO: Partitioning set #8 (pages 15-16). INFO: Partitioning set #6 (pages 11-12). INFO: Partitioning set #5 (pages 13-15). INFO: Partitioning set #8 (pages 15-16). INFO: Partitioning set #9 (pages 17-18). INFO: Partitioning set #7 (pages 13-14). INFO: Partitioning set #6 (pages 16-18). INFO: Partitioning set #9 (pages 17-18). INFO: Partitioning set #10 (pages 19-19). INFO: Partitioning set #8 (pages 15-16). INFO: Partitioning set #7 (pages 19-21). INFO: Partitioning set #10 (pages 19-19). INFO: Partitioning set #9 (pages 17-17). INFO: Partitioning set #8 (pages 22-23). INFO: HTTP Request: POST http://API_URL//general/v0/general "HTTP/1.1 502 Bad Gateway" ERROR: Request (page 3) failed with status code 502. Waiting to retry, can you someone help me on this one.