CGI-FR / PIMO

Private Input Masked Output - PIMO is a tool for data masking (anonymization, pseudonymization, ...).
https://cgi-fr.github.io/lino-doc/
GNU General Public License v3.0
39 stars 11 forks source link

[PROPOSAL] External masks library #240

Open adrienaury opened 1 year ago

adrienaury commented 1 year ago

Definitions

A masking definition contains the following parts :

The generator is usually defined by the mask part of the masking.yml, except for "hash" and "hashInUri" masks which contains a coherence element.

The coherence is usually defined by some properties added to the mask : seed, cache or the hash part in "hash" and "hashInUri" masks.

The location is defined by the selector part.

What we need to store in a masking library, is only the generator part. When applied in a given context, we can choose where we apply it (selector) and how we handle consistency (cache, seed, hash + what source field is used).

Note: we can allow coherence information in some dedicated masks. Note: we can allow selector information in case of multiple fields output.

Examples

This generator :

- randomChoiceInUri: "pimo://nameFR"

Can be used in differnt contexts :

# synthesize new data :
- selector:
    jsonpath: "name1"
  masks:
    - add: ""
    - randomChoiceInUri: "pimo://nameFR"

# synthesize new data consistently with another field:
- selector:
    jsonpath: "name2"
  masks:
    - add: ""
    - randomChoiceInUri: "pimo://nameFR"
  seed:
    field: "id"

# pseudonymize consistently with another field:
- selector:
    jsonpath: "name3"
  mask:
    randomChoiceInUri: "pimo://nameFR"
  seed:
    field: "id"

...

How to define a mask library

The library should expose a variety of data types

This can be done by storing a single file for each data type, that contains the list of masks to apply.

filename : person_name_fr_FR.yml

version: "1":
masking:
- selector:
    jsonpath: "."
  mask:
    randomChoiceInUri: "pimo://nameFR"

It's similar to a normal masking. Except for the "." jsonpath that allow to write on the current location in the json stream (where the mask is applied).

Some generators can take parameters

filename : nir.yml

masking:
  - selector:
      jsonpath: "gender"         #if present then gender is used a parameter 
    masks:
      - add: true                       #add parameter if not present 
      - randomChoice: [1, 2]
    preserve: "value"               #preserve parameter value if present 
# other parameters ...
  - selector:
      jsonpath: "nir"
    masks:
      - add: true  #in this example, the result will be created in a new subfield
      - template: '{{if eq .gender "M" }}1{{else}}2{{end}}{{.birth_date | substr 8 10}}{{.birth_date | substr 3 5}}{{.department_code | printf "%02d"}}{{.city_code | printf "%03d"}}{{.order | printf "%03d"}}'
      - template: '{{ sub 97 (mod (int64 .nir_start)  97)}}'

How to use masks library

The library can be a folder, a git repository, a website, ...

A new property need to be created to load the library, in the masking.yml

version: "1"
librairies:
- "http://domain.org/mylibrary"
- "pimo://internal-library"
- "https+git://github.com/repo/library.git@v0.1.0"
- "file://mylocalibrary"

Then a mask from library can be used via a new type of mask

- selector:
    jsonpath: "nir"
  mask:
    generate:
      using: "nir" # name of the yaml file in the library

Passing parameters : option 1

- selector:
    jsonpath: "nir"
  mask:
    generate:
      using: "nir" # name of the yaml file in the library
      with:
        gender: "M"

or, if we want to use an existing field as parameter

- selector:
    jsonpath: "nir"
  mask:
    generate:
      using: "nir"
      with:
        gender: { from: "gender" }

Passing parameters : option 2

# precreate a param with a value
- selector:
    jsonpath: "gender"
  mask:
    constant: "M"
# call mask on the current document (selector: ".")
- selector:
    jsonpath: "."
  mask:
    generate:
      using: "nir" # name of the yaml file in the library
youen commented 1 year ago

Some suggestions:

In this context, "generator" is a list, so I suggest using the plural form:

generators:
-
- 

Or this the generator (singular) that is defined with a list of masks ?

Make the git support explicit in the URL scheme:

version: "1"
load:
- "http://domain.org/mylibrary"
- "https+git://github.com/repo/library.git@v0.1.0"
- "file://mylocalibrary"
youen commented 1 year ago

To embed the generator in a binary and expose it using the "pimo://" scheme, consider the following example:

version: "1"
load:
- "http://domain.org/mylibrary"
- "https+git://github.com/repo/library.git@v0.1.0"
- "file://mylocalibrary"
- "pimo://embedded_generator"

This way, you can include the generator within the pimo binary and access it using the "pimo://" scheme.

adrienaury commented 1 year ago

Or this the generator (singular) that is defined with a list of masks ?

Yes, the generator is defined by the whole list

adrienaury commented 1 year ago

Note: first post updated

A generator could also be defined like this

filename : nir.yml

masking:
  - selector:
      jsonpath: "gender"         #if present then gender is used a parameter 
    masks:
      - add: true                       #add parameter if not present 
      - randomChoice: [1, 2]
    preserve: "value"               #preserve parameter value if present 
# other parameters ...
  - selector:
      jsonpath: "nir"
    masks:
      - add: true  #in this example, the result will be created in a new subfield
      - template: '{{if eq .gender "M" }}1{{else}}2{{end}}{{.birth_date | substr 8 10}}{{.birth_date | substr 3 5}}{{.department_code | printf "%02d"}}{{.city_code | printf "%03d"}}{{.order | printf "%03d"}}'
      - template: '{{ sub 97 (mod (int64 .nir_start)  97)}}'

This is a normal masking definition except for the preserve "value" option that does not exist yet.

The call to the generator :

- selector:
    jsonpath: "nir"
  mask:
    generate:
      using: "nir"
      with:
        gender:  # this field is of type MaskType
          - constant: 2

MaskType : https://github.com/CGI-FR/PIMO/blob/8daf79d7b9b389444b730aa8d2332c730cf6bf64/pkg/model/model.go#L167

This way, generator can use other generators, for example

person.yml

version: "1"
masking:
  - selector:
      jsonpath: "first_name"
    mask:
      - add: true
      - generate:
          using: "first_name_fr_FR"
  - selector:
      jsonpath: "last_name"
    mask:
      - add: true
      - generate:
          using: "last_name_fr_FR"
  - selector:
      jsonpath: "." # generate in the current document
    mask:
      - add: true
      - generate:
          using: "nir"

person-with-coherence.yml

version: "1"
masking:
  - selector:
      jsonpath: "."
    mask:
      - add: true
      - generate:
          using: "person"
    seed: "."