corpus CLI - Githubissues

To be able to release the corpus API, we need a way for users to CRUD the corpora on a a given source storage. To make our lives a little easier, we are not targeting a UI for this yet, but should start with a CLI instead. Since the likely consumers of this feature are power users or admins, this is ok.

We can add a ragna corpus subcommand to the CLI. This in turn could have more subcommands:

ragna corpus list: List all available corpora
ragna corpus ingest: Ingest some documents into a given corpus (more on this later)
ragna corpus delete: Delete a given corpus
ragna corpus metadata: List all available metadata in a given corpus

Each command needs the source storage the action should be applied to. We have a few options here that we potentially can implement all:

Add a --source-storage flag that accepts an import string similar to what we do in our config file, e.g. --source-storage ragna.source_storages.Chroma
Allow passing a --config and only accept a --source-storage listed there. Also allow passing the source storage by its display name similar to the API, since we know the options.
If we have a --config parameter and --source-storage is not passed, offer the user an interactive list of available source storages to select from

ragna corpus ingest is the trickiest of them. IMO a reasonable default behavior would be

We recurse through a positionally supplied root directory
We call LocalDocument.from_path on each path for which we have an available DocumentHandler

From there on it is just calling SourceStorage.store and injecting them into Ragna's database.

The tricky part comes when opening this up to other behavior than just the default:

Can we assume that by calling ragna corpus ingest you want to ingest files from local disk?
Should we enforce that every Document class has a from_path classmethod in order for us to create an arbitrary Document subclass if we have nothing more than a path?

I would love to hear from @nenb @blakerosenthal @dillonroach how this is done in the existing deployment.

I would love to hear from @nenb @blakerosenthal @dillonroach how this is done in the existing deployment.

We simplified things a little bit by passing a file that contains a list of S3 prefixes (s3://my-bucket/foo/bar/spam.pdf, ...) as an additional argument to our CLI tool.
We also didn't use the Document abstraction at all. Our files were stored on S3. And then our vectors were stored in an external vector database. Nothing was stored on the server that hosted the ragna web API. We had generic handlers for different file formats (PDF, DOC, etc), and we used these extract the data based on the suffix for each filepath.

Can we assume that by calling ragna corpus ingest you want to ingest files from local disk?

This sounds like a job for fsspec. The abstraction is a common interface for dealing with a filesystem, and then we don't need to worry if it's local, remote, whatever - it just needs to be supported by fsspec.

Should we enforce that every Document class has a from_path classmethod in order for us to create an arbitrary Document subclass if we have nothing more than a path?

Playing Devil's Advocate here, why do we need to involve a Document instance here for each file. Like our use-case, I think many users might not want to store their files on the server hosting the ragna web API. In this case, the Document abstraction is not needed?

We simplified things a little bit by passing a file that contains a list of S3 prefixes (s3://my-bucket/foo/bar/spam.pdf, ...) as an additional argument to our CLI tool.

If instead of making the positional argument a root directory only, we allow passing an arbitrary amount of directories and files, we can easily get this behavior:

$ cat files.txt
s3://my-bucket/foo/bar/spam.pdf
s3://my-bucket/foo/bar/ham.pdf
s3://my-bucket/foo/bar/eggs.pdf
$ alias ragna-ingest="python -c 'import sys; print(sys.argv[1:])'"
$ ragna-ingest $(cat files.txt)
['s3://my-bucket/foo/bar/spam.pdf', 's3://my-bucket/foo/bar/ham.pdf', 's3://my-bucket/foo/bar/eggs.pdf']

This is standard CLI behavior and enables users to also glob files with the shell, e.g. ragna ingest foo/*.pdf. Let's do it!

We also didn't use the Document abstraction at all. Our files were stored on S3. And then our vectors were stored in an external vector database. Nothing was stored on the server that hosted the ragna web API. We had generic handlers for different file formats (PDF, DOC, etc), and we used these extract the data based on the suffix for each filepath.

Playing Devil's Advocate here, why do we need to involve a Document instance here for each file. Like our use-case, I think many users might not want to store their files on the server hosting the ragna web API. In this case, the Document abstraction is not needed?

Maybe there is a misunderstanding what the Document class actually is? A Document object is mostly just a dataclass carrying its ID, name, and metadata. It also offers a .read() method to get the file content (bytes). However, there is no need for a Document to be located on the local file system. That is the job of LocalDocument.

For an example, have a look at S3Document from ragna-aws. Apart from the fixed attributes, it only stores the "bucket" as metadata and is able to read the file content from S3 directly.

To answer your question about why we need it here:

We already have the SourceStorage.store method that is used in the document upload workflow. We should re-use it for the corpus CLI as well. And it requires list[Document] as input.
We have to register each document in the Ragna database. This is required to be able to view the sources for a given messages, e.g. #466.

This sounds like a job for fsspec. The abstraction is a common interface for dealing with a filesystem, and then we don't need to worry if it's local, remote, whatever - it just needs to be supported by fsspec.

Yeah, that is a good idea. However, I prefer using upath. It relates to fsspec as pathlib.Path relates to os.path and thus offers a better UX. And it is maintained by @andrewfulton9 :clap:

I wonder if we can merge Document and LocalDocument with this in the future. But I need to think that through first.

(After a short offline chat with @pmeier)

I had a vague concern around latency and storage for large corpuses (eg instantiating 100K Document instances might not be a great thing to do). But after discussion, this is probably fine, as we will only be doing this for the store method, and we can do this in batches. Additionally, it's probably useful to have metadata for each document stored in ragna database, and even for large corpuses, this should not be a problem for the database.

I wonder if we can merge Document and LocalDocument with this in the future. But I need to think that through first.

I opened #251 at one point in case it's of interest here.

Quansight / ragna

corpus CLI #488