UnitVectorY-Labs / firestoreproto2json

Helper library to convert Firestore Protocol Buffer to JSON Object
Apache License 2.0
0 stars 0 forks source link

Add support for Vector field type #21

Open JaredHatfield opened 2 months ago

JaredHatfield commented 2 months ago

Firestore has introduced a new field type of "Vector" which is Pre-GA at this time. This new field type is not supported by firestoreproto2json and therefore won't be supported for generating the map object until it is added.

https://cloud.google.com/firestore/docs/concepts/data-types

JaredHatfield commented 1 month ago

There is not a lot of documentation for the new Vector capability that was added to Firestore. It is pre-GA

The Python SDK shows that the underlying Vector object is being written as a Map to the Firestore document. https://github.com/googleapis/python-firestore/blob/9521deddc5a4b16956f37136f84928ac99688022/google/cloud/firestore_v1/vector.py#L22C9-L24C82

Performing some experiments to understand what is actually happening utilizing the Python SDK.

import os
import firebase_admin
from firebase_admin import credentials
from firebase_admin import firestore
from firebase_admin import initialize_app
from google.cloud.firestore_v1.vector import Vector

app = initialize_app()
db = firestore.client(app=app)

def add_document():
    doc_ref = db.collection(u'embeddings').document(u'sample')
    doc_ref.set({
        "metadata": "foo",
        "embedding_field": Vector([1.0 , 2.0, 3.0])
    })

if __name__ == '__main__':
    add_document()

This code snippet is based off of the documentation provided by Google for this new capability. https://cloud.google.com/firestore/docs/vector-search

This then shows up in the Firestore console and is visible as the new document type:

Screen Shot 2024-05-12 at 9 07 29 AM

Utilizing a Cloud Function connected to the Firestore changes the protocol buffer for this change can then be captured.

CtwBCkxwcm9qZWN0cy9maXJlc3RvcmVwcm90bzJqc29uL2RhdGFiYXNlcy8oZGVmYXVsdCkvZG9jdW1lbnRzL2VtYmVkZGluZ3Mvc2FtcGxlEhIKCG1ldGFkYXRhEgaKAQNmb28SXgoPZW1iZWRkaW5nX2ZpZWxkEksySQoZCghfX3R5cGVfXxINigEKX192ZWN0b3JfXwosCgV2YWx1ZRIjSiEKCRkAAAAAAADwPwoJGQAAAAAAAABACgkZAAAAAAAACEAaCwiw3P6xBhCYwLIRIgsIsNz+sQYQmMCyEQ==

Then utilizing the existing code in firestoreproto2json this will be successfully serialized into JSON without any additional changes. Below is just showing the value representation of the JSON:

{
  "metadata": "foo",
  "embedding_field": {
    "__type__": "__vector__",
    "value": [
      1.0,
      2.0,
      3.0
    ],
    "foo": "bar"
  }
}

This makes sense given the previously referenced Python code snippet where it indicates this is just being represented as a Map object with a __type__ attribute of __vector__ and the value array containing the array of the vector values.


This means there is a decision needed for how firestoreproto2json should turn this representation into JSON. The current implementation technically "works" as it is turning these values into valid JSON. However, like other ambiguous translations between the protocol buffer and an ideal JSON representation this makes sense to add an extension point so this can be customized and a different representation can be generated. That being one where the representation is just an array of numbers without the wrapping object with __type__.