deepset-ai / haystack-integrations

🚀 A list of Haystack Integrations, maintained by the community or deepset.
48 stars 62 forks source link

Add jaguar-document-store.md #117

Open fserv opened 7 months ago

fserv commented 7 months ago

This PR is attempting to add a new document store for haystack framework. The jaguar document store is a distributed database that can be scaled horizontally easily (instant horizontal scaling with its ZeroMove mechanism). It can store documents, vectors, and blobs data. Also it is capable to detect anomalous documents. It can search similar documents with time decay modulations. The software supports multi-tenant multi-member, single-tenant multi-member, and single-tenant single-member cloud operations models.

fserv commented 7 months ago

@bilgeyucel Thanks so much for the suggested changes which are really helpful. We made changes on file jaguar-document-store.md and added JaguarEmbeddingRetriever in the file retriever.py in the https://github.com/fserv/haystack-integrations, directory /src/jaguar_haystack/document_stores. Other suggested changes are also added in the new commit. Could you please review again? Thanks.

fserv commented 7 months ago

@bilgeyucel Thanks for all the pointers and suggestions! All needed changes are made and pushed. The default "all-mpnet-base-v2" embedding model has dimension of 768, which may have caused the js['data'] error. Also please make sure the jaguar server and its http gateway server are up-running after the "docker pull jaguardb/jaguardb_with_http; docker run -d -p 8888:8888 -p 8080:8080 --name jaguardb_with_http jaguardb/jaguardb_with_http" commands (may require sudo on your system).

bilgeyucel commented 7 months ago

Hi @fserv, I get the same errors. Can you make sure that the new version of the package is published on pip?

Here's the code snippet I used to test:

from jaguar_haystack.jaguar import JaguarDocumentStore

url = "http://127.0.0.1:8080/fwww/"
pod = "vdb"
store = "haystack_test_store"
vector_index = "v"
vector_type = "cosine_fraction_float"""
vector_dimension = 1536 # dim of "text-embedding-ada-002" by OpenAI
document_store = JaguarDocumentStore(
    pod,
    store,
    vector_index,
    vector_type,
    vector_dimension,
    url,
)
print(document_store.filter_documents({})) # Should return [] -> works ✅
print(document_store.count_documents()) # Should return 0 -> fails, throws jd = json.loads(js[0]) error

from haystack.components.embedders import OpenAIDocumentEmbedder
from haystack.dataclasses import Document

embedder = OpenAIDocumentEmbedder(api_key=OPENAI_API_KEY)
result = embedder.run(documents=[Document(content="Return of King Lear")])
document_store.write_documents(documents=result["documents"]) # should write the documents
print(document_store.count_documents()) # Should return 1
fserv commented 7 months ago

Hi @bilgeyucel Sorry. The package is now updated. You can try "pip install -U jaguar-haystack" to get the latest package. Thanks!

bilgeyucel commented 6 months ago

Hi @fserv, I can't seem to write_documents() into JaguarDocumentStore even with the new version. Here's the error:

Traceback (most recent call last):
  File "/Users/bilgeyucel/Documents/side-projects/jaguar-haystack/test.py", line 28, in <module>
    document_store.write_documents(documents=result["documents"]) # should write the documents
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/jaguar_haystack/jaguar.py", line 124, in write_documents
    zid = self.add_text(text, embedding, metadata, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/jaguar_haystack/jaguar.py", line 260, in add_text
    textcol = js["data"]
              ~~^^^^^^^^
KeyError: 'data'

Maybe there's something wrong how I run the docker image. These are the logs that I get in the container. I'm using MacBook Air M2

2024-01-30 21:17:03 Starting jaguardb in docker container
2024-01-30 21:17:44 Starting fwww/http server in docker container
2024-01-30 21:17:44 Restart netmap_server ...
2024-01-30 21:17:44 netmap_server is not running
2024-01-30 21:17:46 Restart pyai_server ...
2024-01-30 21:17:46 pyai_server is not running
2024-01-30 21:17:46 netmap_server is not running
2024-01-30 21:17:48 Restart lighttpd and fwww ...
2024-01-30 21:17:48 Stopping fwww ...
2024-01-30 21:17:48 pkill is /usr/bin/pkill
2024-01-30 21:17:48 pyai_server is not running
2024-01-30 21:17:49 Name: sentence-transformers
2024-01-30 21:17:49 Version: 2.2.2
2024-01-30 21:17:49 Summary: Multilingual text embeddings
2024-01-30 21:17:49 Home-page: https://github.com/UKPLab/sentence-transformers
2024-01-30 21:17:49 Author: Nils Reimers
2024-01-30 21:17:49 Author-email: info@nils-reimers.de
2024-01-30 21:17:49 License: Apache License 2.0
2024-01-30 21:17:49 Location: /usr/local/lib/python3.10/dist-packages
2024-01-30 21:17:49 Requires: huggingface-hub, nltk, numpy, scikit-learn, scipy, sentencepiece, torch, torchvision, tqdm, transformers
2024-01-30 21:17:49 Required-by: 
2024-01-30 21:17:49 Found sentence-transformers pip package, OK
2024-01-30 21:17:50 /home/jaguar/fwww/conf_dir/lighttpd.conf is found, OK
2024-01-30 21:17:50 /home/jaguar/fwww/bin_dir/lighttpd -f /home/jaguar/fwww/conf_dir/lighttpd.conf
fserv commented 6 months ago

hi @bilgeyucel most likely one of pyai_server, or netmap_server, or fwww server process is not running properly. You can try:

  1. docker exec -it jaguardb_with_http /bin/bash
  2. ps aux|grep netmap
  3. ps aux|grep pyai
  4. ps aux|grep lighttp
  5. ps aux|grep fwww

If any server process is not up, you can do this:

cd /home/jaguar/fwww/bin_dir ./start_all_servers.sh

and check again with the "ps aux|grep ..." above. There might be package issues, etc.

fserv commented 6 months ago

hi @bilgeyucel We did some debugging and found out our documentation missed the document_store.login("demouser") and document_store.create() steps. Sorry for this error. The server startup messages are just for reporting purposes which can be ignored. We checked Mac system docker container and saw all server processes were started fine. The login() and create() step is added in the latest commits. Please add the login(), create() steps in your script too. Thanks!