[Feature request] Per key cleanup for the GDPR requirements

lanwen commented 5 years ago

Is your feature request related to a problem? Please describe. We are using Pulsar for the event sourcing as the source of truth with the indefinite number of events. Since we are storing some personal information in the events (like name or email) and operating in Europe we have to be compliant to GDPR and be able to remove all the data for the specific key.

Describe the solution you'd like A good thing would be to follow the same way as topic compaction - with the only difference that the only number of keys should be compacted. Admin tool and/or admin API allowing to run compaction for the specific limited set of keys.

Another approach - a way to clean up the tiered storage - like offload the topic after some time to s3 and cleanup that somehow. A possible way would be to do a filtered offload, but with the situation handled where we receive a request for deletion to already offloaded data - so that it should be loaded and re-offloaded properly cleaned with no traces of old keys.

Describe alternatives you've considered Right now we are migrating the whole cluster regularly to clean up all the events that should be removed, what is not fun at all :)

Additional context Kafka doesn't provide anything like that. Also, the recommendation to encrypt the data in the pulsar and later just throw away the encryption key doesn't really work since it's not considered as compliant by some of the EU governments (like in Germany).

lanwen commented 5 years ago

addisonj commented 5 years ago

@lanwen This is only tangentially related, but can you provide some more info on throwing away encryption keys not being complaint? We have considered that (but not yet implemented) and would be helpful to have those references.

lanwen commented 5 years ago

Can't give any specific document and I think that depends on the principal in each case, but generally speaking, some of them provide an argument that still even with the data being encrypted, its user data. And who knows how powerful will be computers in the next years and how easily they could crack the algorithm chosen. So if you want to choose the highest possible compliant way you have to really remove the data.

Zhen-hao commented 5 years ago

This is an interesting viewpoint: https://www.thesslstore.com/blog/deleting-data-for-gdpr-could-encryption-do-the-trick/

lanwen commented 4 years ago

Want to try out some naive implementation and found that current compactor is quite close to what I want in this issue.

Here is the condition which could be inverted and with a bit changed first phase, leaving only specific keys in the map, could do the trick: https://github.com/apache/pulsar/blob/master/pulsar-broker/src/main/java/org/apache/pulsar/compaction/TwoPhaseCompactor.java#L245

The only concern I have is that we couldn't just put everything into a new compacted ledger, as its not that compacted anymore. But how to manage that better? Where to look into a proper ledger management examples? What happens if I just throw everything I have into one ledger?

And also how to cleanup touched ledgers and ignore untouched? Is that a good idea to keep some sort of a set of them to clean? (As that is the final goal of the issue)

some help? @sijie

rcusters commented 3 years ago

We have a similar requirement. The issue is already a year old. @lanwen how did you solve this in the end?

lanwen commented 3 years ago

@rcusters we offload everything to a database on an event gateway (see https://github.com/bsideup/liiklus) with a custom plugin and keep only references and meta information in pulsar. Then if we need to delete something, we delete that from db directly. Same could be achieved with multiple topics with different retention and pulsar functions. We don't have huge numbers yet and quite a naive solution of that kind works pretty well.

eolivelli commented 3 years ago

My colleague @dlg99 started recently a work on BookKeeper (that is the storage layer of Pulsar) in order to address GDPR compliance at storage layer

you can take a look here https://github.com/apache/bookkeeper/pull/2730

ca-simone-chiorazzo commented 1 month ago

We have a similar requirement to satisfy with Pulsar. Given that https://github.com/apache/bookkeeper/pull/2730 was closed 2 years ago I'll try to ask, is there any suggested solution to manage GDPR and the right to be forgotten implemented with Apache Pulsar itself without customizations?

hpvd commented 1 month ago

@ca-simone-chiorazzo a simple always working method is

to encrypt all relevant data
for forgetting: delete the keys used for encryption...

lanwen commented 1 month ago

@hpvd

Also, the recommendation to encrypt the data in the pulsar and later just throw away the encryption key doesn't really work since it's not considered as compliant by some of the EU governments (like in Germany).

hpvd commented 1 month ago

@lanwen thanks for pointing to this (I had a subscription for this issue for a long time and didn't read it again, sorry...) do you have a source for this? Its a practise often used where other methods are not possible at all, e.g.

when working with versioned (git like) databases...
or do you know another way how to handle this topic in database-backups (which are hopefully immutable)?

lanwen commented 1 month ago

We worked closely with Berlin Data Authority back in the days. They explained that for some dangerous cases, encryption could be considered compromised in 10 years perspective, as multiple cases for previous algorithms show that (like RSA128, sha1 hashing, md5 hashing), so with unlimited computational power and access to encrypted data, a malicious agent can still obtain the information. I can't provide documentary proof though. I suspect for something like emails or IPs, it's fine to go the encryption route, but for document information, addresses, medical activity or something relevant in 10 years perspective, it's not. The suggestion was not to use immutable data stores for PII information.

hpvd commented 1 month ago

sure I understand this from technical pov.

The question remains: if you considers this need for future-proofing, how do you handle backups in general?

imho not relying on immutability for backups in a world where ransom attacks, accidental deletion happen is ..hmm

in addition you can not simply be brave - there is lots of data you are forced to store for along long time (10 years) in a secure and available way e.g. in the field of tax.

lanwen commented 1 month ago

I suspect that backups are not something stored for years and eventually rotated. Also, tax-related and financial data is something different and shouldn't be removed under the GDPR request.

But it might be wise to not rely here on issue comments and consult the lawyer for precise advice.

This issue was about the tech possibility of completely wiping out the key from the topic. Until it's implemented (or even considered to be implemented), the suggested workaround is to use mutable data storage for the info requiring deletion which should work for all the cases. Out of the scope of this issue of how to handle backups and other means of data processes, as there might be no limits for possible options and edge cases.

hpvd commented 1 month ago

yep its out of scope. But if you have a solution for backup handling you may have also a solution for this issue...

apache / pulsar

[Feature request] Per key cleanup for the GDPR requirements #5059