Closed andaag closed 4 years ago
@andaag excellent questions.
Using tags.. yes, but at the same time marking the data is only part of the job. Scanning the bucket and deleting the data is probably the larger challenge. I recon the sane way to do this is using retention on the bucket, and continuously regenerating the datasets (at least the gdpr related ones) from a source that deals with gdpr itself. Then you can delegate guarantees on it working to s3/infrastructure instead of depending on a system that scans the s3 bucket(s) looking for gdpr related data.
Thanks for the info in regards of ttl's, I recon this can be closed from my side, but a better solution is a documentation update somewhere. I'll leave that up to you.
Closing this issue. We will make a documentation update appropriately.
In case anyone runs into this, it's actually extremely trivial to add a @gdpr step before your main run that queries historical data using the client api and cleans it up - if you've organized your data in such a way that this is possible. For example with https://github.com/Netflix/metaflow/issues/31 .
Hi
How do you deal with GDPR in the internal data stores? They are versioned and stored over time in permanent storage and some are likely subject to gdpr.