GDPR and data storage - Githubissues

andaag commented 4 years ago

Hi

How do you deal with GDPR in the internal data stores? They are versioned and stored over time in permanent storage and some are likely subject to gdpr.

Is enabling retention on the s3 buckets safe, or will that delete data that metaflow actively uses? (This is a bit complicated once you have a long chain of these though..)
Can you say something about how you deal with gdpr when using metaflow + data storage?

savingoyal commented 4 years ago

@andaag excellent questions.

This depends on your TTL for retention. At a minimum, the TTL should be higher than the overall time it takes for the flow to execute. We checkpoint the state of the workflow on S3, hence this requirement.
For now, you can use tags to tag your sensitive workflows and track sensitive data.

andaag commented 4 years ago

TTL - yes, this makes sense. And we can probably control that fairly easily server side by forcing a timeout when run from CI/CD pipelines. Might be worth making a note of this in documentation.

Using tags.. yes, but at the same time marking the data is only part of the job. Scanning the bucket and deleting the data is probably the larger challenge. I recon the sane way to do this is using retention on the bucket, and continuously regenerating the datasets (at least the gdpr related ones) from a source that deals with gdpr itself. Then you can delegate guarantees on it working to s3/infrastructure instead of depending on a system that scans the s3 bucket(s) looking for gdpr related data.

Thanks for the info in regards of ttl's, I recon this can be closed from my side, but a better solution is a documentation update somewhere. I'll leave that up to you.

savingoyal commented 4 years ago

Closing this issue. We will make a documentation update appropriately.

andaag commented 4 years ago

In case anyone runs into this, it's actually extremely trivial to add a @gdpr step before your main run that queries historical data using the client api and cleans it up - if you've organized your data in such a way that this is possible. For example with https://github.com/Netflix/metaflow/issues/31 .

Netflix / metaflow

GDPR and data storage #21