deepset-ai / haystack

:mag: AI orchestration framework to build customizable, production-ready LLM applications. Connect components (models, vector DBs, file converters) to pipelines or agents that can interact with your data. With advanced retrieval methods, it's best suited for building RAG, question answering, semantic search or conversational agent chatbots.
https://haystack.deepset.ai
Apache License 2.0
16.66k stars 1.83k forks source link

TypeError when inserting DocX converted Documents into PGVector #8251

Closed jlonge4 closed 3 weeks ago

jlonge4 commented 3 weeks ago

Describe the bug When attempting to insert DocX converted document objects into PGVector or using json.dumps() on docs.to_dict() output, a TypeError is raised indicating that datetime objects are not JSON serializable.

Error message

TypeError: Object of type datetime is not JSON serializable

Expected behavior The document objects should be successfully inserted into PGVector, and json.dumps() should be able to serialize the output of docs.to_dict() without errors.

Additional context This issue occurs when working with DocX metadata, which includes datetime objects. The current implementation of the Document class's to_dict() method does not handle the serialization of datetime objects, leading to this error when attempting to convert the document to JSON or insert it into PGVector.

To Reproduce

  1. Convert a DocX file to a Document object using the DocXToDocument converter.
  2. Attempt to insert the resulting Document object into PGVector docstore. OR Try to use json.dumps() on the output of doc.to_dict().

FAQ Check

System:

anakin87 commented 3 weeks ago

Reproducible example

from haystack_integrations.document_stores.pgvector import PgvectorDocumentStore
from haystack.components.converters import DOCXToDocument

document_store = PgvectorDocumentStore(
    table_name="haystack_docs",
    embedding_dimension=768,
    vector_function="cosine_similarity",
    recreate_table=True,
    search_strategy="hnsw",
)

converter = DOCXToDocument()
results = converter.run(sources=["sample_docx.docx"])
documents = results["documents"]

document_store.write_documents(documents=documents)

Error:

Traceback (most recent call last): File "/home/anakin87/apps/haystack-core-integrations/integrations/pgvector/try.py", line 23, in document_store.write_documents(documents=documents) File "/home/anakin87/apps/haystack-core-integrations/integrations/pgvector/src/haystack_integrations/document_stores/pgvector/document_store.py", line 446, in write_documents self.cursor.executemany(sql_insert, db_documents, returning=True) File "/home/anakin87/apps/haystack-core-integrations/integrations/pgvector/.hatch/pgvector-haystack/lib/python3.10/site-packages/psycopg/cursor.py", line 758, in executemany self._conn.wait( File "/home/anakin87/apps/haystack-core-integrations/integrations/pgvector/.hatch/pgvector-haystack/lib/python3.10/site-packages/psycopg/connection.py", line 969, in wait return waiting.wait(gen, self.pgconn.socket, timeout=timeout) File "psycopg_binary/_psycopg/waiting.pyx", line 190, in psycopg_binary._psycopg.wait_c File "/home/anakin87/apps/haystack-core-integrations/integrations/pgvector/.hatch/pgvector-haystack/lib/python3.10/site-packages/psycopg/cursor.py", line 246, in _executemany_gen_pipeline pgq = self._convert_query(query, params) File "/home/anakin87/apps/haystack-core-integrations/integrations/pgvector/.hatch/pgvector-haystack/lib/python3.10/site-packages/psycopg/cursor.py", line 483, in _convert_query pgq.convert(query, params) File "/home/anakin87/apps/haystack-core-integrations/integrations/pgvector/.hatch/pgvector-haystack/lib/python3.10/site-packages/psycopg/_queries.py", line 94, in convert self.dump(vars) File "/home/anakin87/apps/haystack-core-integrations/integrations/pgvector/.hatch/pgvector-haystack/lib/python3.10/site-packages/psycopg/_queries.py", line 105, in dump self.params = self._tx.dump_sequence(params, self._want_formats) File "psycopg_binary/_psycopg/transform.pyx", line 353, in psycopg_binary._psycopg.Transformer.dump_sequence File "psycopg_binary/_psycopg/transform.pyx", line 404, in psycopg_binary._psycopg.Transformer.dump_sequence File "/home/anakin87/apps/haystack-core-integrations/integrations/pgvector/.hatch/pgvector-haystack/lib/python3.10/site-packages/psycopg/types/json.py", line 151, in dump data = dumps(obj) File "/usr/lib/python3.10/json/init.py", line 231, in dumps return _default_encoder.encode(obj) File "/usr/lib/python3.10/json/encoder.py", line 199, in encode chunks = self.iterencode(o, _one_shot=True) File "/usr/lib/python3.10/json/encoder.py", line 257, in iterencode return _iterencode(o, 0) File "/usr/lib/python3.10/json/encoder.py", line 179, in default raise TypeError(f'Object of type {o.class.name} ' TypeError: Object of type datetime is not JSON serializable

anakin87 commented 3 weeks ago

We discussed this with @silvanocerza. The simplest solution seems to be to prevent the DocXToDocument converter from inserting datetime objects into meta: it should convert them internally to strings.

jlonge4 commented 3 weeks ago

@anakin87 thats much more straightforward for sure. Will update accordingly.

anakin87 commented 3 weeks ago

@jlonge4 maybe tomorrow we will discuss this more in depth. I'll let you know...

jlonge4 commented 3 weeks ago

@anakin87 sounds good to me 😎

anakin87 commented 3 weeks ago

After investigating this in more depth, we realized that the issue is that we are creating non-JSON serializable metadata.

We should not do that https://github.com/deepset-ai/haystack/blob/aca8f09f7d5a9318c172b1b6e31fda64d85678d8/haystack/dataclasses/document.py#L64

To reproduce

from haystack.components.converters import DOCXToDocument
import json

converter = DOCXToDocument()
results = converter.run(sources=["./test/test_files/docx/sample_docx_1.docx"])
doc = results["documents"][0]

doc_dict = doc.to_dict(flatten=False)

print(doc_dict["meta"])

# {..., 'docx': {'author': 'Microsoft Office User', ..., 'modified': datetime.datetime(2024, 6, 9, 21, 27, tzinfo=datetime.timezone.utc) ,...}}

print(json.dumps(doc_dict))
# TypeError: Object of type datetime is not JSON serializable

Solution Internally convert these metadata dates into strings.

Additionally, we know that some Document Stores (Chroma and Pinecone) do not support nested metadata, so it might make sense to store this docx meta information at the top level (not nested within the docx key).

anakin87 commented 3 weeks ago

For the time being, we decided to only convert dates into strings.

jlonge4 commented 3 weeks ago

@anakin87 awesome that solves it from my end. I'll close the PR out. Thanks for this!