chroma-core / chroma

the AI-native open-source embedding database
https://www.trychroma.com/
Apache License 2.0
14.54k stars 1.21k forks source link

[Feature Request]: Custom persistent storage #757

Open Gandhi-Sagar opened 1 year ago

Gandhi-Sagar commented 1 year ago

Describe the problem

Using chromadb in docker environment, the collection storage is lost once docker container is brought down. I want to persist the collections (docs, embeddings, metadata, ids) to say, azure blob storage, or, aws s3, or somewhere else.

Describe the proposed solution

While configuring the client, get the params for where to persist the data:

chroma_client = chromadb.Client(Settings(chroma_api_impl="rest",
                                        chroma_server_host="localhost",
                                        chroma_server_http_port="8000",
                                        AzureBlobStorageParams = {
                                            "connection_string": "DefaultEndpointsProtocol=https;AccountName=<account-name>;AccountKey=<account-key>;EndpointSuffix=core.windows.net",
                                            "container_name": "my-container"
                                        }
                                    ))

and under the hood, integrate specific library, azure_cli in this case.

Alternatives considered

No response

Importance

i cannot use Chroma without it

Additional Information

No response

HammadB commented 10 months ago

Hi, could you elaborate more on your use case? This isn't usually how databases support storage so curious what you are trying to achieve?

rlleshi commented 6 months ago

@HammadB so for example I've currently deployed a chromadb container via an azure container instance.

If the instance is restarted, then all the data is lost. I want to enable persistence, to say an azure blob storage, so that if the container instance is restarted the data is not lost.

CristianPFM commented 6 months ago

i need the same. could you solved???

rlleshi commented 6 months ago

Yes, I ended up mounting a shared storage to the container and then put the mount path as _persistdirectory for chroma.

CristianPFM commented 6 months ago

@rlleshi can you help me with a mini-tutorial please

rlleshi commented 6 months ago

Sure. I use terraform for deployment, so here are the blocks:

Deploy ChromaDB as an Azure Container Instance:

resource "azurerm_container_group" "my_containers" {
  name                = "${local.name_prefix}Containers"
  location            = azurerm_resource_group.my_resource_group.location
  resource_group_name = azurerm_resource_group.my_resource_group.name
  ip_address_type     = "Public"
  dns_name_label      = "some_dns_name_label"
  os_type             = "Linux"

  container {
    name   = "chromadb"
    image  = "chromadb/chroma"
    cpu    = "4"
    memory = "8"

    ports {
      port     = 8000
      protocol = "TCP"
    }

    volume { # mount shared storage account as a volume
      name                 = "${local.chroma_fileshare_name}-volume"
      mount_path           = local.chroma_mount_path
      read_only            = false
      share_name           = azurerm_storage_share.file_share.name
      storage_account_name = azurerm_storage_account.my_sa.name
      storage_account_key  = azurerm_storage_account.my_sa.primary_access_key
    }

    environment_variables = {
      persist_directory = local.chroma_mount_path # if you want to specify a different mount path
    }
  }
}

Storage Account:

resource "azurerm_storage_account" "my_sa" {
  name                     = "${local.name_prefix}sa"
  resource_group_name      = azurerm_resource_group.my_resource_group.name
  location                 = azurerm_resource_group.my_resource_group.location
  account_tier             = local.account_tier
  account_replication_type = local.account_replication_type
}

File Share:

resource "azurerm_storage_share" "file_share" {
  name                 = local.chroma_fileshare_name
  storage_account_name = azurerm_storage_account.my_sa.name
  quota                = local.storage_share_quota
}

All the variables with local can be defined in a locals.tf file.

lnkirkham-datasparq commented 3 months ago

@rlleshi are you able to provide a bit more detail how you define your chromadb client once you've deployed the contained via terraform