eikek / docspell

Assist in organizing your piles of documents, resulting from scanners, e-mails and other sources with miminal effort.
https://docspell.org
GNU Affero General Public License v3.0
1.65k stars 127 forks source link

Audit logging #2409

Open madduck opened 1 year ago

madduck commented 1 year ago

It would be really useful, if Docspell logged every mutation on an item, i.e. change of metadata, deletion, undeletion, reprocessing, etc.. For each audit log, it should save the new value, the previous value (where applicable), who initiated a change, any metadata about the connection, such as HTTP connection information, and a timestamp, obviously.

v6ak commented 11 months ago

I can work on it if @eikek wants this feature. As it affects the code on many places, I suggest doing this:

  1. I implement audit logging for some specific cases. (E.g. single field, adjustment by addon, adjustment by user.) It should require as few as possible code modifications outside the core audit-related code.
  2. We discuss it and adjust it.
  3. I implement it for all the other cases.
eikek commented 11 months ago

Hm, I think this is quite a big topic and lots of questions come to mind. Here are a few:

So in summary, given that it possibly is a significant change to many places, and there is not much gain for the main audience, my feelings are to currently refrain from doing it.

madduck commented 11 months ago

Thanks @eikek for your views. I almost want to address your last question first, but I think it's better if I outline the design I had in mind when raising this issue. Because I think that also answers many (if not all) your other questions, and shows that ultimately, we're not changing that much, but the benefit is real, for everyone. Even families make mistakes, and having a log available can be very useful in such a situation.

My view of an audit logging system for Docspell is rooted in a table in the database. It's append-only. Every action (linked to an item, for now!) taken in the system yields a new row in this table. A tag removed is recorded, linked to an item, with a timestamp, and possible other information, such as the origin of the request, and the authenticated user, if there is one. Same for any other change of metadata, state, or anything really.

We're not looking at users, tag objects, contacts for now. Yes, contacts would be interesting too, maybe next, but items is by far the most important. Let's only look at items.

So essentially, anywhere in the code a database call is made relating to an item ID, we probably need to insert a call to an internal API that ultimately writes the database row. It might be a lot of places, but conceptually, it's all just the same: provide whatever state you have right now, and describe (in clear text even) to the logging system, what just happened. That's it… for phase 1.

Phase 2 is the hard phase, because the question of how to meaningfully make this information available to the user(s) is very much dependent on what your audience is. But that's fine, we don't need this phase, at least not while we haven't reaped the fruits of phase 1 yet.

For me, the most important thing is that we start logging earlier, rather than later, the sooner, the better. Get the information before it's lost. We can worry about what to do with it later. Meanwhile, it's not going to flood the system, or cause much of a performance hit, because we're only ever adding rows to a database…

And as to your first question about how to get started: no, we don't touch any items, or the database outside of our table, which we only just created. We only start logging when we have the information. Nobody knows or can currently meaningfully discern how a document arrived at a certain state. This information has been lost forever.

PS: My answer to question 2 is: yes, it's the goal to have everything covered. When something happens that's not being covered, then we need to insert that line of logging somewhere in the control path, when we have the most state information.

PPS: PostgreSQL could be made to log anything, but that comes with two issues:

  1. It'd be PostgreSQL-specific, and you'd also need a solution for the other database;
  2. PostgreSQL does not have all the information available, such as the authenticated user. So the logging would be less useful with less information, and more information would get lost.