Discussion: How does the correlation id work? Do we need to change it?

DSpace / dspace-angular

DSpace User Interface built on Angular.io

https://wiki.lyrasis.org/display/DSDOC8x/

BSD 3-Clause "New" or "Revised" License

135 stars 434 forks source link

Discussion: How does the correlation id work? Do we need to change it? #3335

Open pnbecker opened 1 month ago

pnbecker commented 1 month ago

In https://github.com/DSpace/DSpace/pull/3303 we improved the logging of REST requests. As part of this the backend started to log a correlation id, if it was submitted in the request in an http header called X-CORRELATION-ID. It also logs the page that triggered the request against the REST API, if a uuid is submitted in a Header X-REFERRER. While the aforementioned PR implemented this in the backend, https://github.com/DSpace/dspace-angular/pull/1255 implemented it in the frontend. In https://github.com/DSpace/dspace-angular/pull/1465 the place to store the correlation id in the frontend was changed. Furthermore we have an open issue that this is not documented in the REST contract: https://github.com/DSpace/RestContract/issues/245.

During a DSpace developer meeting questions about the correlation id came up:

when and how does it change?
do we have to see this as personal information restricted by GDPR once a user has logged in? When a user logs in the correlation id is related in the dspace.log to the eperson that logged in.
Is the cookie storing the correlation id a strictly necessary cookie, a cookie that falls under the case of legitime interest (so we must provide an opt-out mechanism) or is it something a user must actively agree to? This would decide if we must provide a way to switch it off and how we must include the cookie storing the correlation id in our cookie banner.

pnbecker commented 1 month ago

I went to demo.dspace.org and looked in to the network and storage segments of the web developer tools:

a cookie storing the correlation id is set via JavaScript (there is no Set-Cookie-Header send by the backend and the frontend includes code to set the cookie). This cookie is set as a session cookie.
I was able to login as a user, log out, login as another user, reload the browser, open a request by typing in an address into the the address bar and the correlation id never changed.
The only way I found to get a new correlation id was to either restart the whole browser, or to manually delete the cookie from my browser.

In the dspace-angular source code I found the CorrelationIdService which has a method initCorrelationId. It tries to load a correlationID and generates one if it is not able to load a previously existing one. The browserInitService and the serverInitService are calling this method. I found now user usage of it.

From my understanding: we are creating the correlation id via JavaScript if we are not able to load it from the store or via Cookies. The correlation ID is stored for as long as the browser session exists. It is not being regenerated unless we loose the stored value, which happens when the browser expires the session or someone deletes the correlation id manually.

pnbecker commented 1 month ago

The correlation id might still be a case of strictly necessary cookie. We use it to investigate and track down bugs. I just don't know how GDPR defines strictly necessary cookies and tracking.

pnbecker commented 1 month ago

@abollini Can you please take a look on this? Do you have further information about the correlation id and the gdpr?

abollini commented 2 weeks ago

The correlation id is generated by angular at the first access if one is not already available, this mean that opening multiple tab or windows will share the same correlation id. The goal is to eventually have a "session id" to understand what is going on in the log.

The application can work without this cookie (or the x-correlation-id header). We have a legitimate interest in using it as it allows better monitoring of the system and simplifies the troubleshooting (not sure if this is enough to move it to an opt-out instead of an opt-in mechanism). We should avoid to log the email address at all, better to use an opaque identifier like the eperson uuid as this would make not possible to relate actions to a person just looking to the log file (but you can still do that using also the database).

My recommendation is to include the cookie in our "klaro" consent management as an optional one to help our system administrator to better understand how to platform is used and trouble shoot issues.

BTW: Our x-correlation-id and x-request-id are custom-made solutions to the problem of distributed tracking, where the world's traceId and spanId are used. It would be nice for DSpace to move to the "standard" opentelemetry approach https://opentelemetry.io/ unfortunately, we haven't had yet the time to investigate deeper

tdonohue commented 2 weeks ago

In today's DevMtg we discussed this ticket in more detail. General consensus was that there seems to be two tasks here (also summarized in @abollini's notes above):

The CORRELATION_ID cookie is not required for DSpace to function. It should be listed in our Cookie Settings popup (currently Klaro, soon to be Orejime), and it should be possible for users to "opt out" of the CORRELATION_ID cookie. While accepting this cookie can help DSpace administrators / logs, it is not required. @abollini volunteered that his team can work on this.
We need to analyze the logs to see if we are recording the Email address in the logs when a user logs into DSpace. If so, as @abollini notes above, we should stop tracking emails in the logs and instead use the EPerson UUID or similar. As noted, this means the logs alone cannot be used to "track" a single user, but can only be used in conjunction with the database or User Interface.
- In our meeting, we also had a possible option to have a configuration to enable or disable whether the email was tracked in the logs. But, I like @abollini 's suggestion better of just switching to use the EPerson UUID, as that is only usable if you also have access to the database. So, it provides a way for Sys Admins to still determine which users did things in the system (if necessary for an audit), but it stores that information in a way that it's not identifiable unless you also have database access.

paulo-graca commented 1 week ago

I don't have the required discussion context, but I've addressed a question to our GDPR department to understand if we could store emails in logs for audit purposes and I will translate it here the answer (I will omit some parts):

The GDPR does not explicitly prevent the recording of email addresses in logs, but it does establish that personal data must be collected and processed for a specific, transparent and secure purpose, and that it is only stored when necessary. For systems auditing and administration, as mentioned, email storage may be justified if it is essential to system security and auditing, as it appears it is. However, treatment must comply with the principles of minimization and safety. In practice, best practices for compliance with GDPR requirements should look at: Pseudonymization: Instead of storing emails directly in logs, you use anonymous or pseudonymized identifiers (for example, an email hash). This way, the email can only be recovered by administrators who have access to the original decoding system or database. Restricted access and internal auditing: When email addresses need to be stored in logs, it is important to ensure that access to the logs is restricted to only necessary personnel, with audit controls that record who accesses the information. Limited retention: GDPR also requires that personal data is not kept for longer than necessary. Therefore, logs that include email addresses must comply with the Institution's conservation deadlines, and be deleted after this period, unless there is legal justification for keeping them for longer. ...

Also I would like to add that we are forced by a national law, regarding data protection, to temporally store (for audit purposes) all the user's activity when dealing with personal data (like CRUD operations).

Hope it helps in this discussion!

tdonohue commented 1 week ago

@paulo-graca : Thank you for asking this back to your GDPR department. That's extremely useful information.

It also seems to align well with the idea to possibly replace the email in logs with the EPerson UUID (mentioned in second bullet of my prior comment). Replacing emails in logs with EPerson UUID would provide the "pseudonymization" that your GDPR department mentioned. It would mean the logs by themselves are "anonymous" (meaning you cannot figure out who anyone is if you only have access to the logs). But, a System Administrator could still audit the logs by matching up the EPerson UUID from the database to their activities in the logs.

So, your feedback seems to say that we are heading in the right direction. It also does verify that storing the email in the logs isn't necessary wrong. But, DSpace sites need to be aware that emails can be found in the logs, so they can treat their logs appropriately. That said, it does seem like it'd be better to switch to using the EPerson UUID instead of the email.