galv / lingvo-copy

Apache License 2.0
4 stars 0 forks source link

Provide a way to request takedown of particular data #17

Open galv opened 3 years ago

galv commented 3 years ago

It is possible that some of our data is mislabeled as CC-BY or CC-0. In addition, it is possible that a creator may not have intended for their work to belong in a machine learning dataset.

We should provide away to do to a "take down" request.

Let's do a manual process for now.

A simple takedown@mlcommons.org email would suffice. We ought to specify in the instructions that the person needs to specify the "primary key" for the data.

On the backend, we typically have two stores of truth: The audio data and transcript stored in a BLOB store, and the metadata, stored in a much more structured format.

Spark SQL should be able to easily delete the metadata for a given record. gsutil rm works for deleting the blob data.