evolution: limit the size of interview logs and redesign them (move them to server-side only)

kaligrafy commented 2 years ago

Maybe drop old ones or alert...

kaligrafy commented 2 years ago

Having large logs can make the interview hang or be very slow

tahini commented 2 years ago

First, do logs even have to go to the frontend?

kaligrafy commented 2 years ago

it should not, but right now, we just increment an array so it does. This should be done server-side

kaligrafy commented 2 years ago

and logs could be simplified a lot and keep the same information!

tahini commented 2 years ago

They are server-side only so that is not the issue, though it does seem to make the queries to update a bit slower

tahini commented 1 year ago

As part of the redesign and to ensure confidentiality, logs will go in a separate table, indexed to allow quick writing, one entry per log, with a timestamp, the survey ID, a nullable user ID (null is the participant), the field updated.

Should we log participant data and validated data in separate tables? Or add a boolean field in the single log table?

tahini commented 1 year ago

In case we want to use the logs to get the exact amount of time a user spends on an interview, here's a post about sql queries that can be tweaked to return this info: https://stackoverflow.com/questions/30877926/how-to-group-following-rows-by-not-unique-value/30880137#30880137

tahini commented 1 year ago

@greenscientist You were mentioning that we could use tools like Prometheus for edit logging? I'd like your input on this issue.

Here's the requirements:

In order to analyze post-survey the flow of the interview to study for example the "fardeau du répondant" (participant burden?), we need to know the sequence of actions, how much time was spent on each question, if a given was changed multiple times, etc.

A "thin" version of the requirement is to be able to simply know how much time a user has spent on each section of the survey. In this case, it is not required to track each edit.

Another requirement is to track which users touched which interviews, either for validation or edition. That is for basic security issues: who accessed what, but also to track the work done by people. For specific roles (for example an 'interviewer' for phone surveys), one wants to be able to output for each users how many interviews they touched, how many they started, how many they completed, etc. (this last requirement is issue #43, but it is somewhat related to this one).

Note that each survey can have their own directives wrt what to track. The above requirements are examples for one ongoing survey, but other surveys may have other, or none at all. Evolution should be able to support whatever will allow the survey maintainers to properly get the information that they want/need.

Current implementation: Upon each edit, we save a timestamp, along with the values that were changed/removed and their values. This is saved as an array in a json field in the interviews table. To track the time spent on section, there's a _sections field in the responses, which has a timestamp and name of the section started and all actions done on the sections.

Problems:

This can take a lot of space and slows down saving the interview
The _sections object takes a lot of space in the responses field itself, while this is not related to the survey!
Saving the value can be a confidentiality issue, it is not needed for the purpose of studying the program flow
It does not belong in the interviews table anyway

Possible naive solution:

Still use the database, but with separate tables, with one row per log, not indexed for easy insertion

But is there some more complete and purposeful for this, that is not too hard to implement and integrate with evolution?

greenscientist commented 1 year ago

A quick remark on this. This is a user flow tracking "need". A tracing tool like zipkin could be useful here. There might be better tool now, or thing directly integrated with OpenTelemetry. Will investigate.

greenscientist commented 1 year ago

"Evolution should be able to support whatever" we need to stop being in the business of supporting everything in the world. We need more restricted specifications. Unless we want to spend all our time working on this.

greenscientist commented 1 year ago

Ok, we really have 3 separate requirements here. They can probably be implemented separately. One system could be able to provide all of it, but it might complexify the solution too much, so we will try to not make it a priority.

First the user flow: this is quite standard, and might be provided by standard tools that track user "movement" accross a web application. But that implies that the user send requests to the server, I don't know if a lot of processing is done client side that is transparent to the server.
Second, the survey access audit trail. A few question for this: who need to access it? How often? Do you need a UI? Is it only on security investigations ? A simple structured logging could be dropped and send to security expert as needed
Third the users activity statistics: we will need more details here. Specially on how the information need to be displayed, which would change the implementation method. Is the information need to be queried? Do we just need a "daily" report provided to some admin? PDF file, CSV, other? The idea is to know if we need some long term storage of that or not.

chairemobilite / evolution

evolution: limit the size of interview logs and redesign them (move them to server-side only) #47