Zipstack / unstract

No-code LLM Platform to launch APIs and ETL Pipelines to structure unstructured documents
https://unstract.com
GNU Affero General Public License v3.0
2.5k stars 152 forks source link

fix: [ISSUE] "Error fetching data and indexing: [Errno 2] No such file or directory: '/data/INFILE.pdf'" in API workflow execution #595

Open kun432 opened 3 months ago

kun432 commented 3 months ago

Describe the bug

Following "Getting Started" instructions, but got an error in API workflow. Always the same errors happen.

Error fetching data and indexing: [Errno 2] No such file or directory: '/data/INFILE.pdf'

To reproduce

Following "Getting Started" instructions, except:

Errors happens:

got the same error always.

Expected behavior

Environment details

Additional context

I have 3 sample PDF. Parsing all of them works perfectly in Prompt Studio. OTOH, in both API and Workflow, Parsing all of above always failed.

Screenshots

API reqeust from Postman SS_ 2024-08-19 22 50 41

and logs

59b5ece3a177-20240819

"Run Workflow"

9babcf679378-20240819

Prompt Studio works. SS_ 2024-08-19 22 12 27

Deepak-Kesavan commented 3 months ago

Hi @kun432 .

Thanks for trying out Unstract. The issue mentioned above was a regression noticed in v0.81.0 which broke the API deployment and the fix went in PR #592 . Please try out the latest version of Unstract (v0.81.1) and let us know if all the issues you mentioned were resolved.

kun432 commented 3 months ago

@Deepak-Kesavan Thanks, but

still got the same error in Workflow execution

SS_ 2024-08-20 0 39 20

API returned different error from before

SS_ 2024-08-20 0 46 52

then seems logs were truncated before finished.

SS_ 2024-08-20 0 48 25

my update instructions:

$ docker compose -f docker/docker-compose.yaml down
$ docker rmi $(docker images | grep "unstract/backend" | awk '{print $3}')
$ ./run-platform.sh -u
(snip)
Fetching release tags.
Performing git checkout to v0.81.1.
Performing git pull on v0.81.1.
(snip)

I will remove all unstract images other than DB, and try again.

$ docker rmi $(docker images | grep "unstract/" | awk '{print $3}')
kun432 commented 3 months ago

still unresolved.

Deepak-Kesavan commented 3 months ago

Thanks for the update @kun432 . We will investigate this further and get back.

Please include the logs from unstract-worker if possible.

kun432 commented 3 months ago

I removed everything including all the container images, volumes, networks and tried again but no luck.

Here's unstract-worker's log.

$ docker compose -f docker/docker-compose.yaml logs | grep unstract-worker
unstract-worker            | [2024-08-19 16:21:15 +0000] [9] [DEBUG] Current configuration:
unstract-worker              |   config: ./gunicorn.conf.py
unstract-worker              |   wsgi_app: None
unstract-worker              |   bind: ['0.0.0.0:5002']
unstract-worker              |   backlog: 2048
unstract-worker              |   workers: 2
unstract-worker              |   worker_class: gevent
unstract-worker              |   threads: 2
unstract-worker              |   worker_connections: 1000
unstract-worker              |   max_requests: 0
unstract-worker              |   max_requests_jitter: 0
unstract-worker              |   timeout: 900
unstract-worker              |   graceful_timeout: 30
unstract-worker              |   keepalive: 2
unstract-worker              |   limit_request_line: 4094
unstract-worker              |   limit_request_fields: 100
unstract-worker              |   limit_request_field_size: 8190
unstract-worker              |   reload: False
unstract-worker              |   reload_engine: auto
unstract-worker              |   reload_extra_files: []
unstract-worker              |   spew: False
unstract-worker              |   check_config: False
unstract-worker              |   print_config: False
unstract-worker              |   preload_app: False
unstract-worker              |   sendfile: None
unstract-worker              |   reuse_port: False
unstract-worker              |   chdir: /app
unstract-worker              |   daemon: False
unstract-worker              |   raw_env: []
unstract-worker              |   pidfile: None
unstract-worker              |   worker_tmp_dir: None
unstract-worker              |   user: 0
unstract-worker              |   group: 0
unstract-worker              |   umask: 0
unstract-worker              |   initgroups: False
unstract-worker              |   tmp_upload_dir: None
unstract-worker              |   secure_scheme_headers: {'X-FORWARDED-PROTOCOL': 'ssl', 'X-FORWARDED-PROTO': 'https', 'X-FORWARDED-SSL': 'on'}
unstract-worker              |   forwarded_allow_ips: ['127.0.0.1', '::1']
unstract-worker              |   accesslog: -
unstract-worker              |   disable_redirect_access_to_syslog: False
unstract-worker              |   access_log_format: %(h)s %(l)s %(u)s %(t)s "%(r)s" %(s)s %(b)s "%(f)s" "%(a)s"
unstract-worker              |   errorlog: -
unstract-worker              |   loglevel: debug
unstract-worker              |   capture_output: False
unstract-worker              |   logger_class: gunicorn.glogging.Logger
unstract-worker              |   logconfig: None
unstract-worker              |   logconfig_dict: {}
unstract-worker              |   logconfig_json: None
unstract-worker              |   syslog_addr: udp://localhost:514
unstract-worker              |   syslog: False
unstract-worker              |   syslog_prefix: None
unstract-worker              |   syslog_facility: user
unstract-worker              |   enable_stdio_inheritance: False
unstract-worker              |   statsd_host: None
unstract-worker              |   dogstatsd_tags:
unstract-worker              |   statsd_prefix:
unstract-worker              |   proc_name: None
unstract-worker              |   default_proc_name: unstract.worker.main:app
unstract-worker              |   pythonpath: None
unstract-worker              |   paste: None
unstract-worker              |   on_starting: <function OnStarting.on_starting at 0x2aaaac8d14c0>
unstract-worker              |   on_reload: <function OnReload.on_reload at 0x2aaaac8d15e0>
unstract-worker              |   when_ready: <function WhenReady.when_ready at 0x2aaaac8d1700>
unstract-worker              |   pre_fork: <function Prefork.pre_fork at 0x2aaaac8d1820>
unstract-worker              |   post_fork: <function Postfork.post_fork at 0x2aaaac8d1940>
unstract-worker              |   post_worker_init: <function PostWorkerInit.post_worker_init at 0x2aaaac8d1a60>
unstract-worker              |   worker_int: <function WorkerInt.worker_int at 0x2aaaac8d1b80>
unstract-worker              |   worker_abort: <function WorkerAbort.worker_abort at 0x2aaaac8d1ca0>
unstract-worker              |   pre_exec: <function PreExec.pre_exec at 0x2aaaac8d1dc0>
unstract-worker              |   pre_request: <function PreRequest.pre_request at 0x2aaaac8d1ee0>
unstract-worker              |   post_request: <function PostRequest.post_request at 0x2aaaac8d1f70>
unstract-worker              |   child_exit: <function ChildExit.child_exit at 0x2aaaac8f60d0>
unstract-worker              |   worker_exit: <function WorkerExit.worker_exit at 0x2aaaac8f61f0>
unstract-worker              |   nworkers_changed: <function NumWorkersChanged.nworkers_changed at 0x2aaaac8f6310>
unstract-worker              |   on_exit: <function OnExit.on_exit at 0x2aaaac8f6430>
unstract-worker              |   ssl_context: <function NewSSLContext.ssl_context at 0x2aaaac8f6550>
unstract-worker              |   proxy_protocol: False
unstract-worker              |   proxy_allow_ips: ['127.0.0.1', '::1']
unstract-worker              |   keyfile: None
unstract-worker              |   certfile: None
unstract-worker              |   ssl_version: 2
unstract-worker              |   cert_reqs: 0
unstract-worker              |   ca_certs: None
unstract-worker              |   suppress_ragged_eofs: True
unstract-worker              |   do_handshake_on_connect: False
unstract-worker              |   ciphers: None
unstract-worker              |   raw_paste_global_conf: []
unstract-worker              |   permit_obsolete_folding: False
unstract-worker              |   strip_header_spaces: False
unstract-worker              |   permit_unconventional_http_method: False
unstract-worker              |   permit_unconventional_http_version: False
unstract-worker              |   casefold_http_method: False
unstract-worker              |   forwarder_headers: ['SCRIPT_NAME', 'PATH_INFO']
unstract-worker              |   header_map: drop
unstract-worker              | [2024-08-19 16:21:15 +0000] [9] [INFO] Starting gunicorn 23.0.0
unstract-worker              | [2024-08-19 16:21:15 +0000] [9] [DEBUG] Arbiter booted
unstract-worker              | [2024-08-19 16:21:15 +0000] [9] [INFO] Listening at: http://0.0.0.0:5002 (9)
unstract-worker              | [2024-08-19 16:21:15 +0000] [9] [INFO] Using worker: gevent
unstract-worker              | [2024-08-19 16:21:15 +0000] [12] [INFO] Booting worker with pid: 12
unstract-worker              | [2024-08-19 16:21:15 +0000] [14] [INFO] Booting worker with pid: 14
unstract-worker              | [2024-08-19 16:21:15 +0000] [9] [DEBUG] 2 workers
unstract-worker              | [2024-08-19 16:42:09 +0000] [14] [DEBUG] POST /v1/api/container/run
unstract-worker              | [2024-08-19 16:42:09,625] INFO in docker: Image 'unstract/tool-structure:0.0.37' not found in the local system.
unstract-worker              | [2024-08-19 16:42:09,626] INFO in docker: Pulling the container: unstract/tool-structure:0.0.37
unstract-worker              | [2024-08-19 16:42:16,740] INFO in docker: CONTAINER PULL STATUS: Downloading - 9317ce34db73 : [===================================>               ]  3.146MB/4.415MB
unstract-worker              | [2024-08-19 16:42:21,440] INFO in docker: CONTAINER PULL STATUS: Downloading - 92bcef436858 : [==>                                                ]  20.97MB/383.1MB
unstract-worker              | [2024-08-19 16:42:26,540] INFO in docker: CONTAINER PULL STATUS: Downloading - 92bcef436858 : [===>                                               ]  29.36MB/383.1MB
unstract-worker              | [2024-08-19 16:42:31,543] INFO in docker: CONTAINER PULL STATUS: Downloading - 92bcef436858 : [=======>                                           ]  55.57MB/383.1MB
unstract-worker              | [2024-08-19 16:42:36,640] INFO in docker: CONTAINER PULL STATUS: Downloading - 92bcef436858 : [=========>                                         ]   73.4MB/383.1MB
unstract-worker              | [2024-08-19 16:42:41,645] INFO in docker: CONTAINER PULL STATUS: Downloading - 4ca670d7c17b : [================================>                  ]  174.1MB/267.4MB
unstract-worker              | [2024-08-19 16:42:46,743] INFO in docker: CONTAINER PULL STATUS: Downloading - 4ca670d7c17b : [=====================================>             ]  198.2MB/267.4MB
unstract-worker              | [2024-08-19 16:42:51,749] INFO in docker: CONTAINER PULL STATUS: Downloading - 4ca670d7c17b : [=========================================>         ]  220.2MB/267.4MB
unstract-worker              | [2024-08-19 16:42:56,844] INFO in docker: CONTAINER PULL STATUS: Downloading - 4ca670d7c17b : [=============================================>     ]  244.3MB/267.4MB
unstract-worker              | [2024-08-19 16:43:01,853] INFO in docker: CONTAINER PULL STATUS: Downloading - 92bcef436858 : [========================>                          ]  185.6MB/383.1MB
unstract-worker              | [2024-08-19 16:43:11,745] INFO in docker: CONTAINER PULL STATUS: Downloading - 92bcef436858 : [====================================>              ]  277.9MB/383.1MB
unstract-worker              | [2024-08-19 16:43:21,844] INFO in docker: CONTAINER PULL STATUS: Downloading - 92bcef436858 : [================================================>  ]  372.2MB/383.1MB
unstract-worker              | [2024-08-19 16:43:31,414] INFO in docker: Finished pulling the container: unstract/tool-structure:0.0.37
unstract-worker              | [2024-08-19 16:43:31,417] INFO in docker: Docker config: {'name': 'unstract-tool-structure-01f76e24-2112-4085-b474-f8a82e02c2a3', 'image': 'unstract/tool-structure:0.0.37', 'command': ['--command', 'RUN', '--settings', '{"challenge_llm": "openai-gpt-4o-mini", "enable_challenge": false, "tool_instance_id": "39756bdd-b4da-4b5a-aedc-0e1840b66865", "prompt_registry_id": "6f71175a-d0f7-49e6-8e1d-eb1cc6f4dd96", "summarize_as_source": false, "challenge_llm_adapter_id": "2d8dd4ad-4963-49f1-ae8b-84930c0c7f95", "single_pass_extraction_mode": false}', '--log-level', 'DEBUG'], 'detach': True, 'stream': True, 'auto_remove': False, 'environment': {'PLATFORM_SERVICE_HOST': 'http://unstract-platform-service', 'PLATFORM_SERVICE_PORT': '3001', 'PLATFORM_SERVICE_API_KEY': '9407564b-d996-4a6e-bf83-19c518aa5240', 'PROMPT_HOST': 'http://unstract-prompt-service', 'PROMPT_PORT': '3003', 'X2TEXT_HOST': 'http://unstract-x2text-service', 'X2TEXT_PORT': '3004', 'ADAPTER_LLMW_POLL_INTERVAL': '30', 'ADAPTER_LLMW_MAX_POLLS': '1000', 'TOOL_DATA_DIR': '/data'}, 'stderr': True, 'stdout': True, 'network': 'unstract-network', 'mounts': [{'type': 'bind', 'source': '/Users/kun432/repository/unstract/docker/workflow_data/execution/mock_org/aa624005-4bcb-4fad-a0e3-1ff3dfe26cf8/3e48657a-22c3-4fe5-81dc-e42789e3c465', 'target': '/data'}], 'labels': []}
unstract-worker              | [2024-08-19 16:43:32,118] INFO in worker: Running Docker container: unstract-tool-structure-01f76e24-2112-4085-b474-f8a82e02c2a3
unstract-worker              | [2024-08-19 16:43:59,247] ERROR in worker: Error while running docker container unstract-tool-structure-01f76e24-2112-4085-b474-f8a82e02c2a3: Error fetching data and indexing: [Errno 2] No such file or directory: '/data/INFILE.pdf'
unstract-worker              | Traceback (most recent call last):
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/unstract/worker/worker.py", line 216, in run_container
unstract-worker              |     self.stream_logs(
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/unstract/worker/worker.py", line 47, in stream_logs
unstract-worker              |     self.process_log_message(
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/unstract/worker/worker.py", line 85, in process_log_message
unstract-worker              |     raise ToolRunException(log_dict.get("log"))
unstract-worker              | unstract.worker.exception.ToolRunException: Error fetching data and indexing: [Errno 2] No such file or directory: '/data/INFILE.pdf'
unstract-worker              | Stack (most recent call last):
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/gevent/baseserver.py", line 34, in _handle_and_close_when_done
unstract-worker              |     return handle(*args_tuple)
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/gunicorn/workers/ggevent.py", line 123, in handle
unstract-worker              |     super().handle(listener, client, addr)
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/gunicorn/workers/base_async.py", line 54, in handle
unstract-worker              |     self.handle_request(listener_name, req, client, addr)
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/gunicorn/workers/ggevent.py", line 127, in handle_request
unstract-worker              |     super().handle_request(listener_name, req, sock, addr)
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/gunicorn/workers/base_async.py", line 107, in handle_request
unstract-worker              |     respiter = self.wsgi(environ, resp.start_response)
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/flask/app.py", line 1498, in __call__
unstract-worker              |     return self.wsgi_app(environ, start_response)
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/flask/app.py", line 1473, in wsgi_app
unstract-worker              |     response = self.full_dispatch_request()
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/flask/app.py", line 880, in full_dispatch_request
unstract-worker              |     rv = self.dispatch_request()
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/flask/app.py", line 865, in dispatch_request
unstract-worker              |     return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)  # type: ignore[no-any-return]
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/unstract/worker/main.py", line 36, in run_container
unstract-worker              |     result = worker.run_container(
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/unstract/worker/worker.py", line 224, in run_container
unstract-worker              |     self.logger.error(
unstract-worker              | ERROR:unstract.worker.main:Error while running docker container unstract-tool-structure-01f76e24-2112-4085-b474-f8a82e02c2a3: Error fetching data and indexing: [Errno 2] No such file or directory: '/data/INFILE.pdf'
unstract-worker              | Traceback (most recent call last):
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/unstract/worker/worker.py", line 216, in run_container
unstract-worker              |     self.stream_logs(
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/unstract/worker/worker.py", line 47, in stream_logs
unstract-worker              |     self.process_log_message(
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/unstract/worker/worker.py", line 85, in process_log_message
unstract-worker              |     raise ToolRunException(log_dict.get("log"))
unstract-worker              | unstract.worker.exception.ToolRunException: Error fetching data and indexing: [Errno 2] No such file or directory: '/data/INFILE.pdf'
unstract-worker              | Stack (most recent call last):
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/gevent/baseserver.py", line 34, in _handle_and_close_when_done
unstract-worker              |     return handle(*args_tuple)
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/gunicorn/workers/ggevent.py", line 123, in handle
unstract-worker              |     super().handle(listener, client, addr)
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/gunicorn/workers/base_async.py", line 54, in handle
unstract-worker              |     self.handle_request(listener_name, req, client, addr)
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/gunicorn/workers/ggevent.py", line 127, in handle_request
unstract-worker              |     super().handle_request(listener_name, req, sock, addr)
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/gunicorn/workers/base_async.py", line 107, in handle_request
unstract-worker              |     respiter = self.wsgi(environ, resp.start_response)
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/flask/app.py", line 1498, in __call__
unstract-worker              |     return self.wsgi_app(environ, start_response)
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/flask/app.py", line 1473, in wsgi_app
unstract-worker              |     response = self.full_dispatch_request()
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/flask/app.py", line 880, in full_dispatch_request
unstract-worker              |     rv = self.dispatch_request()
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/flask/app.py", line 865, in dispatch_request
unstract-worker              |     return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)  # type: ignore[no-any-return]
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/unstract/worker/main.py", line 36, in run_container
unstract-worker              |     result = worker.run_container(
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/unstract/worker/worker.py", line 224, in run_container
unstract-worker              |     self.logger.error(
unstract-worker              | 192.168.112.14 - - [19/Aug/2024:16:43:59 +0000] "POST /v1/api/container/run HTTP/1.1" 200 132 "-" "python-requests/2.31.0"
unstract-worker              | [2024-08-19 16:43:59 +0000] [14] [DEBUG] Closing connection.
unstract-worker              | [2024-08-19 16:44:52 +0000] [12] [DEBUG] POST /v1/api/container/run
unstract-worker              | [2024-08-19 16:44:53,006] INFO in docker: Image 'unstract/tool-structure:0.0.37' found in the local system.
unstract-worker              | [2024-08-19 16:44:53,007] INFO in docker: Docker config: {'name': 'unstract-tool-structure-743c1a1e-09e3-4a16-829a-3774d949af13', 'image': 'unstract/tool-structure:0.0.37', 'command': ['--command', 'RUN', '--settings', '{"challenge_llm": "openai-gpt-4o-mini", "enable_challenge": false, "tool_instance_id": "39756bdd-b4da-4b5a-aedc-0e1840b66865", "prompt_registry_id": "6f71175a-d0f7-49e6-8e1d-eb1cc6f4dd96", "summarize_as_source": false, "challenge_llm_adapter_id": "2d8dd4ad-4963-49f1-ae8b-84930c0c7f95", "single_pass_extraction_mode": false}', '--log-level', 'DEBUG'], 'detach': True, 'stream': True, 'auto_remove': False, 'environment': {'PLATFORM_SERVICE_HOST': 'http://unstract-platform-service', 'PLATFORM_SERVICE_PORT': '3001', 'PLATFORM_SERVICE_API_KEY': '9407564b-d996-4a6e-bf83-19c518aa5240', 'PROMPT_HOST': 'http://unstract-prompt-service', 'PROMPT_PORT': '3003', 'X2TEXT_HOST': 'http://unstract-x2text-service', 'X2TEXT_PORT': '3004', 'ADAPTER_LLMW_POLL_INTERVAL': '30', 'ADAPTER_LLMW_MAX_POLLS': '1000', 'TOOL_DATA_DIR': '/data'}, 'stderr': True, 'stdout': True, 'network': 'unstract-network', 'mounts': [{'type': 'bind', 'source': '/Users/kun432/repository/unstract/docker/workflow_data/execution/mock_org/aa624005-4bcb-4fad-a0e3-1ff3dfe26cf8/914b1f13-ce9b-4716-8996-d8dd154023c6', 'target': '/data'}], 'labels': []}
unstract-worker              | [2024-08-19 16:44:53,261] INFO in worker: Running Docker container: unstract-tool-structure-743c1a1e-09e3-4a16-829a-3774d949af13
unstract-worker              | [2024-08-19 16:45:33 +0000] [14] [DEBUG] POST /v1/api/container/run
unstract-worker              | [2024-08-19 16:45:33,992] INFO in docker: Image 'unstract/tool-structure:0.0.37' found in the local system.
unstract-worker              | INFO:unstract.worker.main:Image 'unstract/tool-structure:0.0.37' found in the local system.
unstract-worker              | [2024-08-19 16:45:33,993] INFO in docker: Docker config: {'name': 'unstract-tool-structure-d0e9b87c-3d2e-4451-b5a7-99519bec3e26', 'image': 'unstract/tool-structure:0.0.37', 'command': ['--command', 'RUN', '--settings', '{"challenge_llm": "openai-gpt-4o-mini", "enable_challenge": false, "tool_instance_id": "39756bdd-b4da-4b5a-aedc-0e1840b66865", "prompt_registry_id": "6f71175a-d0f7-49e6-8e1d-eb1cc6f4dd96", "summarize_as_source": false, "challenge_llm_adapter_id": "2d8dd4ad-4963-49f1-ae8b-84930c0c7f95", "single_pass_extraction_mode": false}', '--log-level', 'DEBUG'], 'detach': True, 'stream': True, 'auto_remove': False, 'environment': {'PLATFORM_SERVICE_HOST': 'http://unstract-platform-service', 'PLATFORM_SERVICE_PORT': '3001', 'PLATFORM_SERVICE_API_KEY': '9407564b-d996-4a6e-bf83-19c518aa5240', 'PROMPT_HOST': 'http://unstract-prompt-service', 'PROMPT_PORT': '3003', 'X2TEXT_HOST': 'http://unstract-x2text-service', 'X2TEXT_PORT': '3004', 'ADAPTER_LLMW_POLL_INTERVAL': '30', 'ADAPTER_LLMW_MAX_POLLS': '1000', 'TOOL_DATA_DIR': '/data'}, 'stderr': True, 'stdout': True, 'network': 'unstract-network', 'mounts': [{'type': 'bind', 'source': '/Users/kun432/repository/unstract/docker/workflow_data/execution/mock_org/aa624005-4bcb-4fad-a0e3-1ff3dfe26cf8/814d91a5-e474-436e-88ca-94305bcc9e4d', 'target': '/data'}], 'labels': []}
unstract-worker              | INFO:unstract.worker.main:Docker config: {'name': 'unstract-tool-structure-d0e9b87c-3d2e-4451-b5a7-99519bec3e26', 'image': 'unstract/tool-structure:0.0.37', 'command': ['--command', 'RUN', '--settings', '{"challenge_llm": "openai-gpt-4o-mini", "enable_challenge": false, "tool_instance_id": "39756bdd-b4da-4b5a-aedc-0e1840b66865", "prompt_registry_id": "6f71175a-d0f7-49e6-8e1d-eb1cc6f4dd96", "summarize_as_source": false, "challenge_llm_adapter_id": "2d8dd4ad-4963-49f1-ae8b-84930c0c7f95", "single_pass_extraction_mode": false}', '--log-level', 'DEBUG'], 'detach': True, 'stream': True, 'auto_remove': False, 'environment': {'PLATFORM_SERVICE_HOST': 'http://unstract-platform-service', 'PLATFORM_SERVICE_PORT': '3001', 'PLATFORM_SERVICE_API_KEY': '9407564b-d996-4a6e-bf83-19c518aa5240', 'PROMPT_HOST': 'http://unstract-prompt-service', 'PROMPT_PORT': '3003', 'X2TEXT_HOST': 'http://unstract-x2text-service', 'X2TEXT_PORT': '3004', 'ADAPTER_LLMW_POLL_INTERVAL': '30', 'ADAPTER_LLMW_MAX_POLLS': '1000', 'TOOL_DATA_DIR': '/data'}, 'stderr': True, 'stdout': True, 'network': 'unstract-network', 'mounts': [{'type': 'bind', 'source': '/Users/kun432/repository/unstract/docker/workflow_data/execution/mock_org/aa624005-4bcb-4fad-a0e3-1ff3dfe26cf8/814d91a5-e474-436e-88ca-94305bcc9e4d', 'target': '/data'}], 'labels': []}
unstract-worker              | [2024-08-19 16:45:34,230] INFO in worker: Running Docker container: unstract-tool-structure-d0e9b87c-3d2e-4451-b5a7-99519bec3e26
unstract-worker              | INFO:unstract.worker.main:Running Docker container: unstract-tool-structure-d0e9b87c-3d2e-4451-b5a7-99519bec3e26
unstract-worker              | [2024-08-19 16:45:44,519] ERROR in worker: Error while running docker container unstract-tool-structure-743c1a1e-09e3-4a16-829a-3774d949af13: Error fetching data and indexing: [Errno 2] No such file or directory: '/data/INFILE.pdf'
unstract-worker              | Traceback (most recent call last):
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/unstract/worker/worker.py", line 216, in run_container
unstract-worker              |     self.stream_logs(
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/unstract/worker/worker.py", line 47, in stream_logs
unstract-worker              |     self.process_log_message(
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/unstract/worker/worker.py", line 85, in process_log_message
unstract-worker              |     raise ToolRunException(log_dict.get("log"))
unstract-worker              | unstract.worker.exception.ToolRunException: Error fetching data and indexing: [Errno 2] No such file or directory: '/data/INFILE.pdf'
unstract-worker              | Stack (most recent call last):
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/gevent/baseserver.py", line 34, in _handle_and_close_when_done
unstract-worker              |     return handle(*args_tuple)
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/gunicorn/workers/ggevent.py", line 123, in handle
unstract-worker              |     super().handle(listener, client, addr)
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/gunicorn/workers/base_async.py", line 54, in handle
unstract-worker              |     self.handle_request(listener_name, req, client, addr)
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/gunicorn/workers/ggevent.py", line 127, in handle_request
unstract-worker              |     super().handle_request(listener_name, req, sock, addr)
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/gunicorn/workers/base_async.py", line 107, in handle_request
unstract-worker              |     respiter = self.wsgi(environ, resp.start_response)
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/flask/app.py", line 1498, in __call__
unstract-worker              |     return self.wsgi_app(environ, start_response)
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/flask/app.py", line 1473, in wsgi_app
unstract-worker              |     response = self.full_dispatch_request()
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/flask/app.py", line 880, in full_dispatch_request
unstract-worker              |     rv = self.dispatch_request()
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/flask/app.py", line 865, in dispatch_request
unstract-worker              |     return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)  # type: ignore[no-any-return]
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/unstract/worker/main.py", line 36, in run_container
unstract-worker              |     result = worker.run_container(
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/unstract/worker/worker.py", line 224, in run_container
unstract-worker              |     self.logger.error(
unstract-worker              | ERROR:unstract.worker.main:Error while running docker container unstract-tool-structure-743c1a1e-09e3-4a16-829a-3774d949af13: Error fetching data and indexing: [Errno 2] No such file or directory: '/data/INFILE.pdf'
unstract-worker              | Traceback (most recent call last):
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/unstract/worker/worker.py", line 216, in run_container
unstract-worker              |     self.stream_logs(
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/unstract/worker/worker.py", line 47, in stream_logs
unstract-worker              |     self.process_log_message(
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/unstract/worker/worker.py", line 85, in process_log_message
unstract-worker              |     raise ToolRunException(log_dict.get("log"))
unstract-worker              | unstract.worker.exception.ToolRunException: Error fetching data and indexing: [Errno 2] No such file or directory: '/data/INFILE.pdf'
unstract-worker              | Stack (most recent call last):
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/gevent/baseserver.py", line 34, in _handle_and_close_when_done
unstract-worker              |     return handle(*args_tuple)
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/gunicorn/workers/ggevent.py", line 123, in handle
unstract-worker              |     super().handle(listener, client, addr)
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/gunicorn/workers/base_async.py", line 54, in handle
unstract-worker              |     self.handle_request(listener_name, req, client, addr)
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/gunicorn/workers/ggevent.py", line 127, in handle_request
unstract-worker              |     super().handle_request(listener_name, req, sock, addr)
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/gunicorn/workers/base_async.py", line 107, in handle_request
unstract-worker              |     respiter = self.wsgi(environ, resp.start_response)
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/flask/app.py", line 1498, in __call__
unstract-worker              |     return self.wsgi_app(environ, start_response)
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/flask/app.py", line 1473, in wsgi_app
unstract-worker              |     response = self.full_dispatch_request()
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/flask/app.py", line 880, in full_dispatch_request
unstract-worker              |     rv = self.dispatch_request()
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/flask/app.py", line 865, in dispatch_request
unstract-worker              |     return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)  # type: ignore[no-any-return]
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/unstract/worker/main.py", line 36, in run_container
unstract-worker              |     result = worker.run_container(
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/unstract/worker/worker.py", line 224, in run_container
unstract-worker              |     self.logger.error(
unstract-worker              | 192.168.112.14 - - [19/Aug/2024:16:45:44 +0000] "POST /v1/api/container/run HTTP/1.1" 200 132 "-" "python-requests/2.31.0"
unstract-worker              | [2024-08-19 16:45:44 +0000] [12] [DEBUG] Ignoring EPIPE
unstract-worker              | [2024-08-19 16:46:04,721] ERROR in worker: Error while running docker container unstract-tool-structure-d0e9b87c-3d2e-4451-b5a7-99519bec3e26: Error fetching data and indexing: [Errno 2] No such file or directory: '/data/INFILE.pdf'
unstract-worker              | Traceback (most recent call last):
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/unstract/worker/worker.py", line 216, in run_container
unstract-worker              |     self.stream_logs(
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/unstract/worker/worker.py", line 47, in stream_logs
unstract-worker              |     self.process_log_message(
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/unstract/worker/worker.py", line 85, in process_log_message
unstract-worker              |     raise ToolRunException(log_dict.get("log"))
unstract-worker              | unstract.worker.exception.ToolRunException: Error fetching data and indexing: [Errno 2] No such file or directory: '/data/INFILE.pdf'
unstract-worker              | Stack (most recent call last):
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/gevent/baseserver.py", line 34, in _handle_and_close_when_done
unstract-worker              |     return handle(*args_tuple)
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/gunicorn/workers/ggevent.py", line 123, in handle
unstract-worker              |     super().handle(listener, client, addr)
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/gunicorn/workers/base_async.py", line 54, in handle
unstract-worker              |     self.handle_request(listener_name, req, client, addr)
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/gunicorn/workers/ggevent.py", line 127, in handle_request
unstract-worker              |     super().handle_request(listener_name, req, sock, addr)
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/gunicorn/workers/base_async.py", line 107, in handle_request
unstract-worker              |     respiter = self.wsgi(environ, resp.start_response)
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/flask/app.py", line 1498, in __call__
unstract-worker              |     return self.wsgi_app(environ, start_response)
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/flask/app.py", line 1473, in wsgi_app
unstract-worker              |     response = self.full_dispatch_request()
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/flask/app.py", line 880, in full_dispatch_request
unstract-worker              |     rv = self.dispatch_request()
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/flask/app.py", line 865, in dispatch_request
unstract-worker              |     return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)  # type: ignore[no-any-return]
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/unstract/worker/main.py", line 36, in run_container
unstract-worker              |     result = worker.run_container(
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/unstract/worker/worker.py", line 224, in run_container
unstract-worker              |     self.logger.error(
unstract-worker              | ERROR:unstract.worker.main:Error while running docker container unstract-tool-structure-d0e9b87c-3d2e-4451-b5a7-99519bec3e26: Error fetching data and indexing: [Errno 2] No such file or directory: '/data/INFILE.pdf'
unstract-worker              | Traceback (most recent call last):
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/unstract/worker/worker.py", line 216, in run_container
unstract-worker              |     self.stream_logs(
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/unstract/worker/worker.py", line 47, in stream_logs
unstract-worker              |     self.process_log_message(
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/unstract/worker/worker.py", line 85, in process_log_message
unstract-worker              |     raise ToolRunException(log_dict.get("log"))
unstract-worker              | unstract.worker.exception.ToolRunException: Error fetching data and indexing: [Errno 2] No such file or directory: '/data/INFILE.pdf'
unstract-worker              | Stack (most recent call last):
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/gevent/baseserver.py", line 34, in _handle_and_close_when_done
unstract-worker              |     return handle(*args_tuple)
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/gunicorn/workers/ggevent.py", line 123, in handle
unstract-worker              |     super().handle(listener, client, addr)
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/gunicorn/workers/base_async.py", line 54, in handle
unstract-worker              |     self.handle_request(listener_name, req, client, addr)
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/gunicorn/workers/ggevent.py", line 127, in handle_request
unstract-worker              |     super().handle_request(listener_name, req, sock, addr)
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/gunicorn/workers/base_async.py", line 107, in handle_request
unstract-worker              |     respiter = self.wsgi(environ, resp.start_response)
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/flask/app.py", line 1498, in __call__
unstract-worker              |     return self.wsgi_app(environ, start_response)
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/flask/app.py", line 1473, in wsgi_app
unstract-worker              |     response = self.full_dispatch_request()
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/flask/app.py", line 880, in full_dispatch_request
unstract-worker              |     rv = self.dispatch_request()
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/flask/app.py", line 865, in dispatch_request
unstract-worker              |     return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)  # type: ignore[no-any-return]
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/unstract/worker/main.py", line 36, in run_container
unstract-worker              |     result = worker.run_container(
unstract-worker              |   File "/app/.venv/lib/python3.9/site-packages/unstract/worker/worker.py", line 224, in run_container
unstract-worker              |     self.logger.error(
unstract-worker              | 192.168.112.14 - - [19/Aug/2024:16:46:05 +0000] "POST /v1/api/container/run HTTP/1.1" 200 132 "-" "python-requests/2.31.0"
unstract-worker              | [2024-08-19 16:46:05 +0000] [14] [DEBUG] Closing connection.
kun432 commented 3 months ago

also some things I found:

Hope these are of some help.

Deepak-Kesavan commented 3 months ago

Error fetching data and indexing: [Errno 2] No such file or directory: '/data/INFILE.pdf'

Regarding the above error, we are using the file named INFILE without extension. But somehow it is looking for the file name INFILE with extension pdf. Looks like this is the issue you are facing when running the workflow or calling the API. I will see if I can replicate the same in my machine and provide you with a proper solution or raise a PR is this needs fix.

Deepak-Kesavan commented 3 months ago

@kun432 I am unable to replicate the issue. I even tried renaming the file to Japanese text, and it ran successfully. Have you tried using a different file other than the one you are currently using?

Additionally, by setting REMOVE_CONTAINER_ON_EXIT=False in the worker's .env file, you can prevent the tool container from being removed, which might provide additional logs.

ritwik-g commented 3 months ago

@kun432 in addition to what @Deepak-Kesavan suggested can you also check if there are any files with in the below folder

/Users/kun432/repository/unstract/docker/workflow_data/execution/mock_org/aa624005-4bcb-4fad-a0e3-1ff3dfe26cf8/3e48657a-22c3-4fe5-81dc-e42789e3c465

Please share the ls output on this folder.

kun432 commented 3 months ago

@Deepak-Kesavan

I even tried renaming the file to Japanese text, and it ran successfully.

This means you used Llama Parse as text extractor?

Because, as I said before,

Using LLMWhisperer as text extractor with my Japanese invoice files, seems working fine although sometime worker was killed.

so, I don't think this problems come from my Japanese invoice PDF files.

Deepak-Kesavan commented 3 months ago

@Deepak-Kesavan

I even tried renaming the file to Japanese text, and it ran successfully.

This means you used Llama Parse as text extractor?

Because, as I said before,

Using LLMWhisperer as text extractor with my Japanese invoice files, seems working fine although sometime worker was killed.

so, I don't think this problems come from my Japanese invoice PDF files.

@kun432 I initially thought the issue might be due to the name of the PDF, but it seems that was not the case.

Could you please provide the information mentioned in the comments above by @ritwik-g and me so we can debug this further?

ritwik-g commented 3 months ago

also some things I found:

  • Using Llama Parse as text extractor, "Getting Started" instruction with "credit card statements" didn't work even I used demo files. In that case, parsing failed a lot.
  • Changed text extractor from Llama Parse to LLMWhisperer, "Getting Started" instruction with "credit card statements" works perfect.
  • Using LLMWhisperer as text extractor with my Japanese invoice files, seems working fine although sometime worker was killed.

Hope these are of some help.

@kun432 I missed this message earlier. If your use case is working fine with LLMWhisperer I think then the issue might be that the Llama Parser fails to parse japanese text? So can you confirm if the issue is happening mainly with Llama Parse? If that's the case you might need to try using Llama Parse directly once to see if the extraction is working or not.

kun432 commented 3 months ago

@Deepak-Kesavan @ritwik-g

(after removed all cloned repo, containers, images, volumes and re-clone) newly set up and tested with Llama Parse as text extractor, and the same error happened again (currently seems 100% reproducable).

I summarized my whole setup procedures and logs below: https://gist.github.com/kun432/a8d7238c9c1fd738aed5f7d7771ba4a5

kun432 commented 3 months ago

adding the result of using LLMWhisperer above Gist (see the last comment)

I missed this message earlier. If your use case is working fine with LLMWhisperer I think then the issue might be that the Llama Parser fails to parse japanese text? So can you confirm if the issue is happening mainly with Llama Parse? If that's the case you might need to try using Llama Parse directly once to see if the extraction is working or not.

Llama Parse CAN handle Japanese text.

SS_ 2024-08-23 17 20 09

result SS_ 2024-08-23 17 19 52

Using Llama Parse, as I attached a screen shot in my first comment, extracting seems working in Prompt Studio (this means the document was parsed using Llama Parse, right?).

so I guess there's something wrong in workflow execution and it will show up only when using Llama Parse.

ritwik-g commented 3 months ago

@kun432 yes this might be llama parse specific problem. Thanks for the detailed steps for reproducing. Let us take a look in to this.

ritwik-g commented 3 months ago

@kun432 looks like this is an issue already reported by our QA. This is a high priority bug but we are working on some other critical items. Will be picking this up as soon as possible.

For the time being if you are able to make use of llmwhisprer please try to use it.