adithya-s-k / omniparse

Ingest, parse, and optimize any data format ➡️ from documents to multimedia ➡️ for enhanced compatibility with GenAI frameworks
https://docs.cognitivelab.in
GNU General Public License v3.0
4.37k stars 350 forks source link

Unexpected error: Connection errored out. #12

Closed simoma02 closed 3 weeks ago

simoma02 commented 3 weeks ago

Installation

Installed the server through docker and launched it without GPU: docker pull savatar101/omniparse:0.1 docker run -p 8000:8000 savatar101/omniparse:0.1

Server is up and running and access to web interface succeeds on http://localhost:8000/

Test

Interface

When I upload my PDF I'm met with following non-blocking error in devtools console: image

index.js:2250 Uncaught (in promise) 
TypeError: Cannot set properties of undefined (setting 'abort_controller')
    at stream (index.js:2250:10)
    at UploadProgress.svelte:41:18
    at run (svelte.js:41:9)
    at Array.map (<anonymous>)
    at svelte.js:3182:48
    at flush (svelte.js:2141:5)

I select Semantic Chunking as chunking strategy.

I click on Parse Document. After 10 seconds I'm met with an error on the interface, the server stopped and I get following error in devtools console: image

index.js:1304 GET http://localhost:8000/queue/data?session_hash=7hn1yq6z7re net::ERR_INCOMPLETE_CHUNKED_ENCODING 200 (OK)
index.js:2036 Unexpected error Connection errored out. 

API

Python-code:

import os
import requests

def parse_pdf(pdf_path):
    url = "http://localhost:8000/parse_document/pdf"
    files = {'file': open(pdf_path, 'rb')}
    response = requests.post(url, files=files)

    if response.status_code == 200:
        return response.json()
    else:
        print(f"Failed to parse PDF. Status code: {response.status_code}")
        print(response.text)
        return None

def parse_pdfs_in_folder(folder_path):
    for filename in os.listdir(folder_path):
        if filename.endswith(".pdf"):
            pdf_path = os.path.join(folder_path, filename)
            print(f"Parsing {pdf_path}")
            result = parse_pdf(pdf_path)
            if result:
                print(f"Result for {pdf_path}:")
                print(result)
            print("\n")

if __name__ == "__main__":
    # Replace with the path to your folder containing PDFs
    folder_path = ""
    parse_pdfs_in_folder(folder_path)

I tried through the API and get the same result. Server stops running and error: Unexpected error Connection errored out.

Complete docker output

==========
== CUDA ==
==========

CUDA Version 11.8.0

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

WARNING: The NVIDIA Driver was not detected.  GPU functionality will not be available.
   Use the NVIDIA Container Toolkit to start this container with GPU support; see
   https://docs.nvidia.com/datacenter/cloud-native/ .

/usr/local/lib/python3.10/dist-packages/pydantic/_internal/_fields.py:161: UserWarning: Field "model_list" has conflict with protected namespace "model_".

You may be able to resolve this warning by setting `model_config['protected_namespaces'] = ()`.
  warnings.warn(

       .88888.                      oo  888888ba
      d8'   `8b                         88    `8b
      88     88 88d8b.d8b. 88d888b. dP a88aaaa8P' .d8888b. 88d888b. .d8888b. .d8888b.
      88     88 88'`88'`88 88'  `88 88  88        88'  `88 88'  `88 Y8ooooo. 88ooood8
      Y8.   .8P 88  88  88 88    88 88  88        88.  .88 88             88 88.  ...
       `8888P'  dP  dP  dP dP    dP dP  dP        `88888P8 dP       `88888P' `88888P'

[LOG] ✅ Loading OCR Model
Loaded detection model vikp/surya_det2 on device cpu with dtype torch.float32
Loaded detection model vikp/surya_layout2 on device cpu with dtype torch.float32
Loaded reading order model vikp/surya_order on device cpu with dtype torch.float32
Loaded recognition model vikp/surya_rec on device cpu with dtype torch.float32
Loaded texify model to cpu with torch.float32 dtype
[LOG] ✅ Loading Vision Model
A new version of the following files was downloaded from https://huggingface.co/microsoft/Florence-2-base:
- configuration_florence2.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
A new version of the following files was downloaded from https://huggingface.co/microsoft/Florence-2-base:
- modeling_florence2.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
A new version of the following files was downloaded from https://huggingface.co/microsoft/Florence-2-base:
- processing_florence2.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
[LOG] ✅ Loading Audio Model
100%|███████████████████████████████████████| 461M/461M [00:34<00:00, 14.0MiB/s]
[LOG] ✅ Loading Web Crawler
INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
INFO:     172.17.0.1:46688 - "GET / HTTP/1.1" 200 OK
INFO:     172.17.0.1:46688 - "GET /theme.css?v=98e6f074f8ea1c8f23afa849d4ab06e7683210321bede27c710902965bd150a2 HTTP/1.1" 200 OK
INFO:     172.17.0.1:57796 - "POST /upload?upload_id=y5kiwbrzjr HTTP/1.1" 200 OK
INFO:     172.17.0.1:43742 - "POST /queue/join HTTP/1.1" 200 OK
INFO:     172.17.0.1:43742 - "GET /queue/data?session_hash=a6alxc3yhpk HTTP/1.1" 200 OK
Detecting bboxes:   0%|          | 0/1 [00:00<?, ?it/s]
filippotoso commented 3 weeks ago

Same issue here, please advice.

simoma02 commented 3 weeks ago

Was able to make it run on a Azure VM, with and without GPU.

adithya-s-k commented 3 weeks ago

before writing client code i would suggest using the /docs and the UI to see if everything is working and after confirming, its better to test with custom client code

I am currently working on a client library that will integrate with all the popular AI frameworks like langchain, llama index

Update coming soon👍🏼

simoma02 commented 2 weeks ago

Not sure how this responds to the issue. I did test with the UI as well. The problem is not with the API, but with the docker container stops running on my local machine. (Windows 10, with no GPU) On a Azure VM > Linux (ubuntu 20.04), the docker container works.