Quansight / ragna

RAG orchestration framework ⛵️
https://ragna.chat
BSD 3-Clause "New" or "Revised" License
177 stars 22 forks source link

Default metadata #477

Open pierrotsmnrd opened 1 month ago

pierrotsmnrd commented 1 month ago

Feature description

This issue lists the default metadata that would be nice to have :

Not all metadata might be available when uploading a file, we need to figure out which ones are possible.

Value and/or benefit

No response

Anything else?

No response

pmeier commented 1 month ago

Before I go over the individual proposal, one thing upfront: although we use ragna.core.LocalDocument by default, the user is free to use any subclass of ragna.core.Document:

https://github.com/Quansight/ragna/blob/7071cf4fdaae03b89c837f6034dbac217dd81d72/ragna/deploy/_config.py#L146

https://github.com/Quansight/ragna/blob/7071cf4fdaae03b89c837f6034dbac217dd81d72/ragna/core/_document.py#L27

https://github.com/Quansight/ragna/blob/7071cf4fdaae03b89c837f6034dbac217dd81d72/ragna/core/_document.py#L82

The only metadata attached to a plain Document is the ID and the name of the document:

https://github.com/Quansight/ragna/blob/7071cf4fdaae03b89c837f6034dbac217dd81d72/ragna/core/_document.py#L38-L39

Subclasses can add more metadata to this, e.g. LocalDocument adds the path:

https://github.com/Quansight/ragna/blob/7071cf4fdaae03b89c837f6034dbac217dd81d72/ragna/core/_document.py#L112-L121

All this is to say: we need to differentiate between metadata that we can add to all Documents or metadata that only applies to LocalDocuments.


  • original file path

Not applicable to Document and already available for LocalDocument under metadata["path"]

  • filename

Available on all documents in Python with document.name and for MetadataFilter canonically with "document_path"

  • complete file extension
  • filesize

Maybe applicable to all documents, but certainly to LocalDocument. If we want to add it as metadata for all documents, I would like to have a compelling use case. I currently can't think of one.

  • creation date
  • last modification date
  • date of upload

What format would the metadata be for these?

  • username having uploaded the file

The username is stored in the Ragna DB, but not available for filtering.

https://github.com/Quansight/ragna/blob/7071cf4fdaae03b89c837f6034dbac217dd81d72/ragna/deploy/_api/orm.py#L55-L59

What would be the use case here?

pierrotsmnrd commented 1 month ago
pmeier commented 1 month ago
  • filesize use case : It might be useful in order to keep, for example, "all the PDFs big enough to have images"

Let's start with adding that to LocalDocument. There we can be sure that the information is available. I'll send a PR.

  • creation / last modification / upload dates : I'd recommend the format "%Y-%m-%d %H:%M:%S", so we can filter on all files uploaded after a given day for example

I understand the intention, but how would that be implemented? We can't do numeric comparisons like > on strings?

  • username : The use case would be to filter only on documents uploaded by yourself, or by let's say the legal department, etc

If this is required, IMO the user should just have their own corpus or use tags for the department.

To look at it from the other side: what if we have an admin upload documents for the organization. Is the username useful information in this case?

I'd leave this out for now until a concrete use case arises.