JasperFx / marten

.NET Transactional Document DB and Event Store on PostgreSQL
https://martendb.io
MIT License
2.79k stars 441 forks source link

Data anonymization (Feature request) #2779

Open jannikbryld opened 9 months ago

jannikbryld commented 9 months ago

The possibility of anonymizing/deleting Personal Identifiable Information (PII) is a requirement by GDPR. Currently the best approach is to delete the entire stream with store.Advanced.Clean.DeleteSingleEventStreamAsync(streamId), as archiving the stream will not be compliant because PII data is still retained.

I want to be able to anonymize a stream, so that I can retrospectively still use it for projections.

I suggest the following syntax: session.Events.AnonymizeStream(streamId);

Attribute requirements

gfoidl commented 9 months ago

suggest the following syntax: session.Events.AnonymizeStream(streamId);

Do you suggest that the anonymizing happens when requested under GDPR? So plain information in normal use, and when requested the stream is anonymized?

That would mean you change the stream (and thus the history), which is something I don't like.

The stream should be already created with GDPR in mind, so no afterwork is needed by a GDPR-removal requests. E.g. by storing (symmetrical) encrypted PII in the stream. Once GDPR-removal is requested, then you just need to drop the encryption-key for the specific user (and care about backups, etc.[^1]). This technique falls under the term "crypto shredding".

So that falls more into the business-logic and I guess it's hard for Marten to provide a solution that works for every needed use-case here.

[^1]: note I didn't watch the video (I never watch such videos)

jannikbryld commented 9 months ago

Thank you for the answer. That seems like a very sensible solution when you rebuild a system ground-up. I don't see any doc examples taking GDPR into mind, which leads me to believe many other developers will end in a similar position, where they have already created event streams containing sensitive information, and have to fix this retrospectively.

I wholeheartedly agree with you that it is against event sourcing principles to ever change the event stream, but the alternative when having to adhere to GDPR retrospectively, is to permanently delete the event stream and lose information permanently.

When the scenario specifically is event streams that are already populated with PII, do you have a better suggestion than the feature request here?

gfoidl commented 9 months ago

You are right with

which leads me to believe many other developers will end in a similar position

and that there may be (lots) of data already in streams, etc.

A solution that comes into my mind is

  1. create new infrastructure that is capable of using crypto shredding (as outlined in the comment above) -- this is in addition to current use
  2. transform the event streams that contain PII to ones that are safe in regards to GDPR (by using infra from point 1)
  3. delete the old streams
  4. use solely the new infrastructure

So you don't loose any information, as all the events are "copied" over to new stream, but with GDPR in mind. Strictly speaking this violates event sourcing principles too, as history is changed, but I think here this is a valid trade-off.

Edit: BTW I use Azure KeyVault for storing secrets, keys.


Regarding your proposal maybe it would make sense to provide two options:

I think both bullets are doable, the question is how should Marten provide / integrate the necessary infrastructure, especially for the "key store" which should not be part of the event-store (to keep the keys out of backups, etc)? That may require lots of extension points, and hooks, etc.

Thus -- right now -- I believe this is very specific and something that Marten shouldn't provide out-of-the-box to avoid bloat. But for sure there could be assist, and if it be at least in form of documentation or links to some approaches.