dgarnitz / vectorflow

VectorFlow is a high volume vector embedding pipeline that ingests raw data, transforms it into vectors and writes it to a vector DB of your choice.
https://www.getvectorflow.com/
Apache License 2.0
670 stars 47 forks source link

Feature deeplake #50

Closed EteimZ closed 1 year ago

EteimZ commented 1 year ago

What

This PR is integrating deeplake with vectorflow. The current integration only supports adding vectors to Deep Lake Storage. Subsequent PRs will integrate other Storage options.

Why

It seeks to close #17

Usage

To work with this current implementation you will need an activeloop(deeplake) account and you will need to create API tokens.

Specify your vector_db_type to be DEEPLAKE. Place your API key in the X-VectorDB-Key header. Then your index_name should be of this format hub://<active_loop_user_name>/<dataset_name>. The dataset doesn't need to exist, deeplake will create it for you but you can always use existing dataset.

Here's a sample request:

curl -X POST -H 'Content-Type: multipart/form-data' -H "Authorization: my-internal-api-key" -H "X-EmbeddingAPI-Key: your-embedding-key" -H "X-VectorDB-Key: your-deeplake-api-key" -F 'EmbeddingsMetadata={"embeddings_type": "OPEN_AI", "chunk_size": 256, "chunk_overlap": 128}' -F 'SourceData=@./src/api/tests/fixtures/test_text.txt' -F 'VectorDBMetadata={"vector_db_type": "DEEPLAKE", "index_name": "hub://eteimz/vectorflow", "environment": "us-west1-gcp-free"}'  http://localhost:8000/embed

Verification

To verify this PR works I created a file named test_vectorflow.py and added the following content to it:

import deeplake
import os

token = os.environ['ACTIVELOOP_TOKEN']

ds = deeplake.load("hub://eteimz/my_vectorflow_dataset", token=token)

The environment variable contains my activeloop token. Running the script above yeilds this:

deeplake_integration_1

This basically says my specified dataset doesn't exist.

Then I use vectorflow to create that dataset:

deeplake_integration_2

I run the script a second time:

deeplake_integration_3

dgarnitz commented 1 year ago

@EteimZ is there a command or api call you can run in a separate script to verify that the data made it into DeepLake? If so, can you screenshot that and add it as verification evidence? Thanks