Open pierrotsmnrd opened 1 month ago
Before I go over the individual proposal, one thing upfront: although we use ragna.core.LocalDocument
by default, the user is free to use any subclass of ragna.core.Document
:
The only metadata attached to a plain Document
is the ID and the name of the document:
Subclasses can add more metadata to this, e.g. LocalDocument
adds the path:
All this is to say: we need to differentiate between metadata that we can add to all Document
s or metadata that only applies to LocalDocument
s.
- original file path
Not applicable to Document
and already available for LocalDocument
under metadata["path"]
- filename
Available on all documents in Python with document.name
and for MetadataFilter canonically with "document_path"
- complete file extension
- filesize
Maybe applicable to all documents, but certainly to LocalDocument
. If we want to add it as metadata for all documents, I would like to have a compelling use case. I currently can't think of one.
- creation date
- last modification date
- date of upload
What format would the metadata be for these?
- username having uploaded the file
The username is stored in the Ragna DB, but not available for filtering.
What would be the use case here?
"%Y-%m-%d %H:%M:%S"
, so we can filter on all files uploaded after a given day for example
- filesize use case : It might be useful in order to keep, for example, "all the PDFs big enough to have images"
Let's start with adding that to LocalDocument
. There we can be sure that the information is available. I'll send a PR.
- creation / last modification / upload dates : I'd recommend the format
"%Y-%m-%d %H:%M:%S"
, so we can filter on all files uploaded after a given day for example
I understand the intention, but how would that be implemented? We can't do numeric comparisons like >
on strings?
- username : The use case would be to filter only on documents uploaded by yourself, or by let's say the legal department, etc
If this is required, IMO the user should just have their own corpus or use tags for the department.
To look at it from the other side: what if we have an admin upload documents for the organization. Is the username useful information in this case?
I'd leave this out for now until a concrete use case arises.
Feature description
This issue lists the default metadata that would be nice to have :
xyz.foo.bar
would have its complete file extension set tofoo.bar
bar
)Not all metadata might be available when uploading a file, we need to figure out which ones are possible.
Value and/or benefit
No response
Anything else?
No response