Open thorwhalen opened 6 months ago
Perhaps even more useful (can harness parallelism)?
from functools import partial
from concurrent.futures import ThreadPoolExecutor
def vectorize(func, iterable=None, *, max_workers: int = 1):
"""Like builtin map, but returns a list,
and if iterable is None, returns a partial function that can directly be applied
to an iterable.
This is useful, for instance, for making a data loader from any single-uri loader.
Example:
>>> vectorize(lambda x: x**2, [1,2,3])
[1, 4, 9]
>>> vectorized_square = vectorize(lambda x: x**2)
>>> vectorized_square([1,2,3])
[1, 4, 9]
"""
if iterable is None:
return partial(vectorize, func, max_workers=max_workers)
if max_workers == 1:
return list(map(func, iterable))
else:
with ThreadPoolExecutor(max_workers=max_workers) as executor:
return list(executor.map(func, iterable))
Assigning myself here to track the several ideas in this issue.
Describe the problem
I find that most of the time, I already have the data I want to vectorized stored somewhere -- therefore copying it over to chromadb is not only wasteful, but also exposes my system to a bunch of sync-maintenance nightmares.
Looking for a solution, I found this thing called data loaders that, if specified, will by applied to the
uris
to getdocuments
(contents).ConnectionResetError The documentation doesn't make it obvious how I should make a data loader, but found the DataLoader in the code, and (a single!) example of one (ImageLoader) in some "data_loaders.py" module.Apparently, it should be a callable that takes uris and returns their contents (which I guess will be interpreted as
documents
).Firstly: I would say that a bit more documentation on this would be useful.
Second: Having a few more ready-to-use (and possibly parametrize) data loaders would be useful.
Last: though I understand that using a data_loader that applies to an iterable of
uris
makes it easier to create optimized data loaders, I find that in practice, I'll have a function that takes a single uri and returns a document, and I want to use that function as a data_loader.Describe the proposed solution
As far as my last point: I created a
vectorize
function that will create a vectorized version of a "single item" data loader.In fact, this
vectorize
function could be useful beyond data loaders only. We're dealing with a vector database here, and one might find more than one occasion to want to get a vectorized version of some function (for example, to get ids from uris, etc.).Regarding the second point, I've created a
FileLoader
that will load local (text) files by default, but has some a few parameters to easily make many other types of loaders (relative paths, without extension, from remote url, from pdf file, from s3, from a DB...)Alternatives considered
We could also have the inner-mechanism of chromadb accept "single item" data loaders, and dynamically transform them to their vectorized counterpart. This could be seen as a departure from the "explicit over implicit" principle though, so should only be done if there's no ambiguity possible.
Importance
would make my life easier
Additional Information
No response