chroma-core / chroma

the AI-native open-source embedding database
https://www.trychroma.com/
Apache License 2.0
13.71k stars 1.16k forks source link

[Feature Request]: function vectorizer and FileLoader #1606

Open thorwhalen opened 6 months ago

thorwhalen commented 6 months ago

Describe the problem

I find that most of the time, I already have the data I want to vectorized stored somewhere -- therefore copying it over to chromadb is not only wasteful, but also exposes my system to a bunch of sync-maintenance nightmares.

Looking for a solution, I found this thing called data loaders that, if specified, will by applied to the uris to get documents (contents).ConnectionResetError The documentation doesn't make it obvious how I should make a data loader, but found the DataLoader in the code, and (a single!) example of one (ImageLoader) in some "data_loaders.py" module.

Apparently, it should be a callable that takes uris and returns their contents (which I guess will be interpreted as documents).

Firstly: I would say that a bit more documentation on this would be useful.

Second: Having a few more ready-to-use (and possibly parametrize) data loaders would be useful.

Last: though I understand that using a data_loader that applies to an iterable of uris makes it easier to create optimized data loaders, I find that in practice, I'll have a function that takes a single uri and returns a document, and I want to use that function as a data_loader.

Describe the proposed solution

As far as my last point: I created a vectorize function that will create a vectorized version of a "single item" data loader.

from functools import partial

def vectorize(func, iterable=None):
    """Like builtin map, but returns a list, 
    and if iterable is None, returns a partial function that can directly be applied 
    to an iterable.

    Example:

    >>> vectorize(lambda x: x**2, [1,2,3])
    [1, 4, 9]
    >>> vectorized_square = vectorize(lambda x: x**2)
    >>> vectorized_square([1,2,3])
    [1, 4, 9]
    """
    if iterable is None:
        return partial(vectorize, func)
    return list(map(func, iterable))

In fact, this vectorize function could be useful beyond data loaders only. We're dealing with a vector database here, and one might find more than one occasion to want to get a vectorized version of some function (for example, to get ids from uris, etc.).

Regarding the second point, I've created a FileLoader that will load local (text) files by default, but has some a few parameters to easily make many other types of loaders (relative paths, without extension, from remote url, from pdf file, from s3, from a DB...)

Alternatives considered

We could also have the inner-mechanism of chromadb accept "single item" data loaders, and dynamically transform them to their vectorized counterpart. This could be seen as a departure from the "explicit over implicit" principle though, so should only be done if there's no ambiguity possible.

Importance

would make my life easier

Additional Information

No response

thorwhalen commented 6 months ago

Perhaps even more useful (can harness parallelism)?

from functools import partial
from concurrent.futures import ThreadPoolExecutor

def vectorize(func, iterable=None, *, max_workers: int = 1):
    """Like builtin map, but returns a list,
    and if iterable is None, returns a partial function that can directly be applied
    to an iterable.
    This is useful, for instance, for making a data loader from any single-uri loader.

    Example:
    >>> vectorize(lambda x: x**2, [1,2,3])
    [1, 4, 9]
    >>> vectorized_square = vectorize(lambda x: x**2)
    >>> vectorized_square([1,2,3])
    [1, 4, 9]
    """
    if iterable is None:
        return partial(vectorize, func, max_workers=max_workers)
    if max_workers == 1:
        return list(map(func, iterable))
    else:
        with ThreadPoolExecutor(max_workers=max_workers) as executor:
            return list(executor.map(func, iterable))
atroyn commented 3 months ago

Assigning myself here to track the several ideas in this issue.