datacontract / datacontract-specification

The Data Contract Specification Repository
https://datacontract.com/
MIT License
278 stars 41 forks source link

Data Classification - extended #68

Closed pixie79 closed 6 months ago

pixie79 commented 6 months ago

while the current specifications include PII (Personally Identifiable Information) and classification designators, we should extend the design to include more detailed actions and attributes related to PII handling and compliance. Here are some suggestions for enhancing the schema:

PII Handling Attributes:

{
  "fields": [
    {
      "Name": "field_name",
      "Type": "data_type",
      "PII": {
        "IsPii": true,
        "Method": "MASKED",
        "Description": "For a name, replace with 5 asterisks",
        "AnaliticsLinkedEntity": "Linked version of the entity in this record that is used for analytics premasked",
        "RTBF": {
          "Eligible": true,
          "Method": "DELETE",
          "Lifetime": 300,
          "Description": "RTBF applies 300 days after last customer activity"
        }
      },
      "Classification": "classification_string"
    }
  ]
}

Schema Explanation

This structure ensures all necessary details are captured and provides a robust framework for managing PII and compliance with data protection regulations.

pixie79 commented 6 months ago

This probably should apply at the entity level mostly however it may be worth also allowing a whole dataset to also be defined as PII but generally I believe this should be at the entity level especially when you get to Right to be Forgotten, it may be that some bit can be forgotten straight away but other information needs to be kept longer.

e.g if you keep the last 5 addresses of a customer, at the point they stop being a customer if you do need to retain records for legal reasons you probably have no grounds to keep more than the current address.

jochenchrist commented 6 months ago

Thanks, @pixie79 for this contribution.

I agree that PII and RTBF become more and more important and the data contracts should be the source of truth to configure the attributes for individual fields or models.

They also play nicely with some Policy as Code engines.

At the moment, however, I do not feel confident enough to defining a generic fieldset that fits for these engines. So, for now, I'd propose to go with config fields.

E.g.:

fields:
  customer_email_address:
    description: The email address, as entered by the customer. The email address was not verified.
    type: text
    format: email
    pii: true
    classification: sensitive
    config:
      pii:
        method: MASKED
        description: For a name, replace with 5 asterisks
        analiticsLinkedEntity: Linked version of the entity in this record that is used for analytics premasked
        rightToBeForgottonPolicy:
          eligible: true
          method: DELETE
          lifetime: 300
          description: RTBF applies 300 days after last customer activity
pixie79 commented 6 months ago

I think that makes sense and meets what I was looking for from it.