Open pmeier opened 2 months ago
I would love to hear from @nenb @blakerosenthal @dillonroach how this is done in the existing deployment.
We simplified things a little bit by passing a file that contains a list of S3 prefixes (s3://my-bucket/foo/bar/spam.pdf, ...
) as an additional argument to our CLI tool.
We also didn't use the Document
abstraction at all. Our files were stored on S3. And then our vectors were stored in an external vector database. Nothing was stored on the server that hosted the ragna
web API. We had generic handlers for different file formats (PDF, DOC, etc), and we used these extract the data based on the suffix for each filepath.
Can we assume that by calling ragna corpus ingest you want to ingest files from local disk?
This sounds like a job for fsspec
. The abstraction is a common interface for dealing with a filesystem, and then we don't need to worry if it's local, remote, whatever - it just needs to be supported by fsspec
.
Should we enforce that every Document class has a from_path classmethod in order for us to create an arbitrary Document subclass if we have nothing more than a path?
Playing Devil's Advocate here, why do we need to involve a Document
instance here for each file. Like our use-case, I think many users might not want to store their files on the server hosting the ragna
web API. In this case, the Document
abstraction is not needed?
- We simplified things a little bit by passing a file that contains a list of S3 prefixes (
s3://my-bucket/foo/bar/spam.pdf, ...
) as an additional argument to our CLI tool.
If instead of making the positional argument a root directory only, we allow passing an arbitrary amount of directories and files, we can easily get this behavior:
$ cat files.txt
s3://my-bucket/foo/bar/spam.pdf
s3://my-bucket/foo/bar/ham.pdf
s3://my-bucket/foo/bar/eggs.pdf
$ alias ragna-ingest="python -c 'import sys; print(sys.argv[1:])'"
$ ragna-ingest $(cat files.txt)
['s3://my-bucket/foo/bar/spam.pdf', 's3://my-bucket/foo/bar/ham.pdf', 's3://my-bucket/foo/bar/eggs.pdf']
This is standard CLI behavior and enables users to also glob files with the shell, e.g. ragna ingest foo/*.pdf
. Let's do it!
- We also didn't use the
Document
abstraction at all. Our files were stored on S3. And then our vectors were stored in an external vector database. Nothing was stored on the server that hosted theragna
web API. We had generic handlers for different file formats (PDF, DOC, etc), and we used these extract the data based on the suffix for each filepath.Playing Devil's Advocate here, why do we need to involve a
Document
instance here for each file. Like our use-case, I think many users might not want to store their files on the server hosting theragna
web API. In this case, theDocument
abstraction is not needed?
Maybe there is a misunderstanding what the Document
class actually is? A Document
object is mostly just a dataclass carrying its ID, name, and metadata. It also offers a .read()
method to get the file content (bytes). However, there is no need for a Document
to be located on the local file system. That is the job of LocalDocument
.
For an example, have a look at S3Document
from ragna-aws
. Apart from the fixed attributes, it only stores the "bucket"
as metadata and is able to read the file content from S3 directly.
To answer your question about why we need it here:
SourceStorage.store
method that is used in the document upload workflow. We should re-use it for the corpus CLI as well. And it requires list[Document]
as input.This sounds like a job for
fsspec
. The abstraction is a common interface for dealing with a filesystem, and then we don't need to worry if it's local, remote, whatever - it just needs to be supported byfsspec
.
Yeah, that is a good idea. However, I prefer using upath
. It relates to fsspec
as pathlib.Path
relates to os.path
and thus offers a better UX. And it is maintained by @andrewfulton9 :clap:
I wonder if we can merge Document
and LocalDocument
with this in the future. But I need to think that through first.
(After a short offline chat with @pmeier)
I had a vague concern around latency and storage for large corpuses (eg instantiating 100K Document
instances might not be a great thing to do). But after discussion, this is probably fine, as we will only be doing this for the store
method, and we can do this in batches. Additionally, it's probably useful to have metadata for each document stored in ragna
database, and even for large corpuses, this should not be a problem for the database.
I wonder if we can merge Document and LocalDocument with this in the future. But I need to think that through first.
I opened #251 at one point in case it's of interest here.
To be able to release the corpus API, we need a way for users to CRUD the corpora on a a given source storage. To make our lives a little easier, we are not targeting a UI for this yet, but should start with a CLI instead. Since the likely consumers of this feature are power users or admins, this is ok.
We can add a
ragna corpus
subcommand to the CLI. This in turn could have more subcommands:ragna corpus list
: List all available corporaragna corpus ingest
: Ingest some documents into a given corpus (more on this later)ragna corpus delete
: Delete a given corpusragna corpus metadata
: List all available metadata in a given corpusEach command needs the source storage the action should be applied to. We have a few options here that we potentially can implement all:
--source-storage
flag that accepts an import string similar to what we do in our config file, e.g.--source-storage ragna.source_storages.Chroma
--config
and only accept a--source-storage
listed there. Also allow passing the source storage by its display name similar to the API, since we know the options.--config
parameter and--source-storage
is not passed, offer the user an interactive list of available source storages to select fromragna corpus ingest
is the trickiest of them. IMO a reasonable default behavior would beLocalDocument.from_path
on each path for which we have an availableDocumentHandler
From there on it is just calling
SourceStorage.store
and injecting them into Ragna's database.The tricky part comes when opening this up to other behavior than just the default:
ragna corpus ingest
you want to ingest files from local disk?Document
class has afrom_path
classmethod in order for us to create an arbitraryDocument
subclass if we have nothing more than a path?I would love to hear from @nenb @blakerosenthal @dillonroach how this is done in the existing deployment.