Audit logging - Githubissues

madduck commented 1 year ago

It would be really useful, if Docspell logged every mutation on an item, i.e. change of metadata, deletion, undeletion, reprocessing, etc.. For each audit log, it should save the new value, the previous value (where applicable), who initiated a change, any metadata about the connection, such as HTTP connection information, and a timestamp, obviously.

v6ak commented 11 months ago

I can work on it if @eikek wants this feature. As it affects the code on many places, I suggest doing this:

I implement audit logging for some specific cases. (E.g. single field, adjustment by addon, adjustment by user.) It should require as few as possible code modifications outside the core audit-related code.
We discuss it and adjust it.
I implement it for all the other cases.

eikek commented 11 months ago

Hm, I think this is quite a big topic and lots of questions come to mind. Here are a few:

How to integrate this into the current state? It could be done by touching almost every other file to manually add audit logging code, of course. I don't like that approach tbh. It is very hard to maintain for a longer time and slows down future changes.
How to reasonably ensure then that every mutation is in fact logged? The above strategy will inevitably raise the question: "everything covered?"
Performance will be impacted, especially when for every operation the old values have to be retrieved first. What about multi-value updates? What about deletions? Also what about bulk changes and deletions?
Where to store this data? Log files, database etc?
What about changes done by "no user", like a background task?
What is the motivation really? Like who is using this feature and what for? My feeling is that Docspells main target audience (families etc) don't need this. And given that it will impact every user, I'm not a big fan of this feature as of yet, because I can't see it naturally fit in the current design. Perhaps with a refactoring towards event/command based operations, it is more feasible…. But that is a long way.

So in summary, given that it possibly is a significant change to many places, and there is not much gain for the main audience, my feelings are to currently refrain from doing it.

madduck commented 11 months ago

Thanks @eikek for your views. I almost want to address your last question first, but I think it's better if I outline the design I had in mind when raising this issue. Because I think that also answers many (if not all) your other questions, and shows that ultimately, we're not changing that much, but the benefit is real, for everyone. Even families make mistakes, and having a log available can be very useful in such a situation.

My view of an audit logging system for Docspell is rooted in a table in the database. It's append-only. Every action (linked to an item, for now!) taken in the system yields a new row in this table. A tag removed is recorded, linked to an item, with a timestamp, and possible other information, such as the origin of the request, and the authenticated user, if there is one. Same for any other change of metadata, state, or anything really.

We're not looking at users, tag objects, contacts for now. Yes, contacts would be interesting too, maybe next, but items is by far the most important. Let's only look at items.

So essentially, anywhere in the code a database call is made relating to an item ID, we probably need to insert a call to an internal API that ultimately writes the database row. It might be a lot of places, but conceptually, it's all just the same: provide whatever state you have right now, and describe (in clear text even) to the logging system, what just happened. That's it… for phase 1.

Phase 2 is the hard phase, because the question of how to meaningfully make this information available to the user(s) is very much dependent on what your audience is. But that's fine, we don't need this phase, at least not while we haven't reaped the fruits of phase 1 yet.

For me, the most important thing is that we start logging earlier, rather than later, the sooner, the better. Get the information before it's lost. We can worry about what to do with it later. Meanwhile, it's not going to flood the system, or cause much of a performance hit, because we're only ever adding rows to a database…

And as to your first question about how to get started: no, we don't touch any items, or the database outside of our table, which we only just created. We only start logging when we have the information. Nobody knows or can currently meaningfully discern how a document arrived at a certain state. This information has been lost forever.

PS: My answer to question 2 is: yes, it's the goal to have everything covered. When something happens that's not being covered, then we need to insert that line of logging somewhere in the control path, when we have the most state information.

PPS: PostgreSQL could be made to log anything, but that comes with two issues:

It'd be PostgreSQL-specific, and you'd also need a solution for the other database;
PostgreSQL does not have all the information available, such as the authenticated user. So the logging would be less useful with less information, and more information would get lost.

eikek / docspell

Audit logging #2409