Netflix / metaflow

Open Source Platform for developing, scaling and deploying serious ML, AI, and data science systems
https://metaflow.org
Apache License 2.0
8.12k stars 765 forks source link

GDPR and data storage #21

Closed andaag closed 4 years ago

andaag commented 4 years ago

Hi

How do you deal with GDPR in the internal data stores? They are versioned and stored over time in permanent storage and some are likely subject to gdpr.

savingoyal commented 4 years ago

@andaag excellent questions.

andaag commented 4 years ago

Using tags.. yes, but at the same time marking the data is only part of the job. Scanning the bucket and deleting the data is probably the larger challenge. I recon the sane way to do this is using retention on the bucket, and continuously regenerating the datasets (at least the gdpr related ones) from a source that deals with gdpr itself. Then you can delegate guarantees on it working to s3/infrastructure instead of depending on a system that scans the s3 bucket(s) looking for gdpr related data.

Thanks for the info in regards of ttl's, I recon this can be closed from my side, but a better solution is a documentation update somewhere. I'll leave that up to you.

savingoyal commented 4 years ago

Closing this issue. We will make a documentation update appropriately.

andaag commented 4 years ago

In case anyone runs into this, it's actually extremely trivial to add a @gdpr step before your main run that queries historical data using the client api and cleans it up - if you've organized your data in such a way that this is possible. For example with https://github.com/Netflix/metaflow/issues/31 .