JSv4 / OpenContracts

Mass document analytics platform based on LlamaIndex, Pgvector, React and Django.
https://JSv4.github.io/OpenContracts/
Apache License 2.0
659 stars 52 forks source link

[BUG] - Query of a document fails on what appears to be vector store step #158

Closed jessestevens5b closed 1 month ago

jessestevens5b commented 1 month ago

The querying of a document appears to fail on generating the vectoring of the document:

run_query_on_create - instance: CorpusQuery object (4)
django_1           | Created... kick off
django_1           | Obj created: CorpusQuery object (4)
celeryworker_1     | [2024-07-15 05:26:33,248: INFO/SpawnProcess-1] Task opencontractserver.tasks.query_tasks.run_query[b5a57fc1-0d91-4a93-8089-733f0c92905b] received
celeryworker_1     | [2024-07-15 05:26:33,272: WARNING/ForkPoolWorker-1] run_query_on_create - instance: CorpusQuery object (4)
celeryworker_1     | [2024-07-15 05:26:33,272: INFO/ForkPoolWorker-1] Load pretrained SentenceTransformer: multi-qa-MiniLM-L6-cos-v1
django_1           | 172.19.0.1 - - [15/Jul/2024 05:26:33] "POST /graphql/ HTTP/1.1" 200 -
django_1           | 172.19.0.1 - - [15/Jul/2024 05:26:33] "POST /graphql/ HTTP/1.1" 200 -
celeryworker_1     | [2024-07-15 05:26:37,267: INFO/ForkPoolWorker-1] 2 prompts are loaded, with the keys: ['query', 'text']
celeryworker_1     | [2024-07-15 05:26:37,278: WARNING/ForkPoolWorker-1] Setting up vector store...
celeryworker_1     | [2024-07-15 05:26:37,286: WARNING/ForkPoolWorker-1] Vector store: stores_text=True is_embedding_query=True corpus_id='2' document_id=None must_have_text=None flat_metadata=False
celeryworker_1     | [2024-07-15 05:26:37,288: WARNING/ForkPoolWorker-1] Index: <llama_index.core.indices.vector_store.base.VectorStoreIndex object at 0x7f4f3a17b880>
celeryworker_1     | [2024-07-15 05:26:37,302: WARNING/ForkPoolWorker-1] Query engine: <llama_index.core.query_engine.citation_query_engine.CitationQueryEngine object at 0x7f4f3a17bac0>
Batches:   0%|          | 0/1 [00:00<?, ?it/s] WARNING/ForkPoolWorker-1] 
Batches: 100%|##########| 1/1 [00:02<00:00,  2.10s/it]/ForkPoolWorker-1] 
Batches: 100%|##########| 1/1 [00:02<00:00,  2.10s/it]/ForkPoolWorker-1] 
celeryworker_1     | [2024-07-15 05:26:39,455: WARNING/ForkPoolWorker-1] Query failed: list index out of range
celeryworker_1     | [2024-07-15 05:26:39,465: WARNING/ForkPoolWorker-1] run_query_on_create - instance: CorpusQuery object (4)
celeryworker_1     | [2024-07-15 05:26:39,470: INFO/ForkPoolWorker-1] Task opencontractserver.tasks.query_tasks.run_query[b5a57fc1-0d91-4a93-8089-733f0c92905b] succeeded in 6.219355003999226s: None
JSv4 commented 1 month ago

Thanks for the report @jessestevens5b. Funny timing. I think I literally just fixed this with #155. Can you pull the latest main branch, rebuild and test again?

jessestevens5b commented 1 month ago

I now get a new error after updating to latest:

frontend           | 172.18.0.1 - - [15/Jul/2024:07:38:53 +0000] "GET /static/media/default_doc_icon.0704b14dcc4a378609bf.jpg HTTP/1.1" 200 20599 "http://localhost:3000/documents/" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:128.0) Gecko/20100101 Firefox/128.0" "-"
celeryworker_1     | [2024-07-15 07:38:53,251: INFO/SpawnProcess-1] Task opencontractserver.tasks.doc_tasks.nlm_ingest_pdf[fb6119e4-76e4-433c-bba8-6a8e4d7022db] received
celeryworker_1     | [2024-07-15 07:38:53,252: INFO/ForkPoolWorker-1] Task opencontractserver.tasks.doc_tasks.extract_thumbnail[5de90682-ced1-4c7c-a4fc-dc9d9b67b24d] succeeded in 0.34711787199989885s: None
celeryworker_1     | [2024-07-15 07:38:53,254: INFO/ForkPoolWorker-1] nlm_ingest_pdf() - split doc 1 for user 2
nlm-ingestor       | 2024-07-15 07:42:28,584 [Thread-1 (pr] [WARNI]  Tika server returned status: 500
nlm-ingestor       | /usr/local/lib/python3.11/site-packages/numpy/core/fromnumeric.py:3464: RuntimeWarning: Mean of empty slice.
nlm-ingestor       |   return _methods._mean(a, axis=axis, dtype=dtype,
nlm-ingestor       | /usr/local/lib/python3.11/site-packages/numpy/core/_methods.py:192: RuntimeWarning: invalid value encountered in scalar divide
nlm-ingestor       |   ret = ret.dtype.type(ret / rcount)
nlm-ingestor       | /usr/local/lib/python3.11/site-packages/numpy/core/_methods.py:269: RuntimeWarning: Degrees of freedom <= 0 for slice
nlm-ingestor       |   ret = _var(a, axis=axis, dtype=dtype, out=out, ddof=ddof,
nlm-ingestor       | /usr/local/lib/python3.11/site-packages/numpy/core/_methods.py:226: RuntimeWarning: invalid value encountered in divide
nlm-ingestor       |   arrmean = um.true_divide(arrmean, div, out=arrmean,
nlm-ingestor       | /usr/local/lib/python3.11/site-packages/numpy/core/_methods.py:261: RuntimeWarning: invalid value encountered in scalar divide
nlm-ingestor       |   ret = ret.dtype.type(ret / rcount)
nlm-ingestor       | 172.18.0.6 - - [15/Jul/2024 07:42:28] "POST /api/parseDocument?calculate_opencontracts_data=yes&applyOcr=no HTTP/1.1" 200 -
JSv4 commented 1 month ago

That appears to be an issue with the parser nlm-ingestor. I may be able to help debug, but the issue appears to be upstream. Can you share the PDF with me so I can try to debug? In the meantime, I'd recommend opening an issue in nlm-ingestor as well.

JSv4 commented 1 month ago

Hey @jessestevens5b, I'm closing this for now as it's an upstream issue and, hopefully, is an issue with a specific document. Are you seeing this on other documents?