Open tcouch opened 2 years ago
Example data_classification_config.yml
---
- name: open_generate_new
format: markdown
text: |
Will the research generate (including by selecting, or sorting
or combining) any [personal data](#personal_data)?
guidance: |
Generating personal data means creating any **new** personal data,
regardless of what data you're starting with. For example:
* Linking diseases to particular patients
* Spotting trends in tweets mentioning a particular individual
yes: substantial_threat
no: closed_personal
- name: closed_personal
format: markdown
text: |
Will any project input be [personal data](#personal_data)?
guidance: |
Will you be using any personal data at all throughout the research,
even if it's publicly available? For example:
* Personal data such as newspaper articles on celebrities,
or details of patients in a medical trial
* Facebook or twitter posts
yes: public_and_open
no: include_commercial
- name: public_and_open
format: markdown
text: |
Is that [personal data](#personal_data) legally accessible
by the general public with no restrictions on use?
guidance: |
Data is legally accessible if it is **not** behind a
paywall, can be accessed without having to request it, and has
**no** conditions on its use. For example:
* Voter registration records are **not** available, as they have
restrictions on access and use
* Academic articles that are not open source require subscription
to a journal, or a request for access from the author, are **not**
legally accessible
yes: include_commercial
no: no_reidentify
- name: no_reidentify
format: markdown
text: |
Is that [personal data](#personal_data) [pseudonymized](#pseudonymized_data)?
guidance: |
Have steps been taken so that information can no longer be directly
linked to a particular individual, without additional information?
For example:
* Replacing names of patients with patient ID numbers
* Customer records with all name and address details removed
yes: no_reidentify_absolute
no: substantial_threat
- name: no_reidentify_absolute
format: markdown
text: |
Do you have absolute confidence that it is not possible to identify
individuals from the data, either at the point of entry or as a
result of any analysis that may be carried out?
guidance: |
Any data pseudonymised to this degree cannot be connected back to
individuals through analysis, even in combination with other datasets.
For example:
* Research results with generated fake names, where the
pseudonymisation key is deleted, never to be used again
* Anonymous responses to a public survey without any identifying
information
yes: include_commercial
no: no_reidentify_strong
- name: include_commercial
format: markdown
text: |
Will you be working with [commercial-in-confidence information
](#commercial_data) or private third-party intellectual property, or
legally or politically sensitive data?
guidance: |
This is any information that the data provider would not be
comfortable with you publishing, including purchased or requested data.
For example:
* Pay to view news articles are private third-party intellectual
property
* Plans for marketing campaigns, or purchasing strategies, for
companies, are commercial in-confidence data
yes: financial_low
no: open_publication
- name: open_publication
format: markdown
text: |
Will releasing any of the datasets or results impact on the competitive
advantage of the research team?
guidance: |
This includes data that may be planned for publication in the future,
or could be published without any issue, but is not yet publicly
available. For example:
* Results from a study, that a research team hopes to submit to
Nature.
* Visualisations of existing publicly available data
yes: tier_1
no: tier_0
- name: substantial_threat
format: markdown
text: |
Would disclosure pose a substantial threat to the personal safety,
health or security of the data subjects?
guidance: |
Could this data be used to blackmail, target or persecute individuals?
Is it likely that motivated teams might try to access this data
illegally? For example:
* Linking location data to members of a controversial group
* Information on the sexuality of individuals in a region where this
may lead to arrest or abuse
yes: tier_4
no: tier_3
- name: financial_low
format: markdown
text: |
Do you have high confidence that the commercial, legal, reputational or
political consequences of unauthorised disclosure of this data will be
low?
guidance: |
Is there **no risk** that the reputation of the
researcher or data provider will be damaged by this data being made
public, or that legal action can be taken as a result? For example:
* Financial reports that an organisation sells to businesses for
commercial profit
* Anonymised non-controversial user research
yes: publishable
no: sophisticated_attack
- name: publishable
format: markdown
text: |
Do you have high confidence that the commercial, legal, reputational or
political consequences of unauthorised disclosure of this data will be
so low as to be trivial?
guidance: |
Would the data providers be prepared to release their data
(accidentally or deliberately)? For example:
* Results that a data provider has indicated they are happy to go
into a research publication
* Fully anonymised data on trends not linked to a company or
commercial interests
yes: tier_1
no: tier_2
- name: no_reidentify_strong
format: markdown
text: |
Do you have strong confidence that it is not possible to identify
individuals from the data, either at the point of entry or as a
result of any analysis that may be carried out?
guidance: |
Any data pseudonymised to this degree cannot be connected with
individuals, unless combined with data not publicly available, **or**
the effort required to de-pseudonymise would be too high to
be feasible for a person acting on their own. For example:
* Medical test results with generated fake names, where only the
pseudonymisation key be used to identify the patients in this one
study
* Anonymous responses to a public survey, where questions may lead
to identifying information in combination with purchasable IP address
data
yes: include_commercial_personal
no: sophisticated_attack
- name: include_commercial_personal
format: markdown
text: |
Will you also be working with [commercial-in-confidence information]
(#commercial_data) or private third-party intellectual property, or
legally or politically sensitive data?
guidance: |
This is any information that the data provider would not be comfortable
with you publishing, including purchased or requested data.
For example:
* Pay to view news articles are private third-party intellectual
property
* Plans for marketing campaigns, or purchasing strategies, for
companies, are commercial in-confidence data
yes: financial_low_personal
no: tier_2
- name: financial_low_personal
format: markdown
text: |
Do you have high confidence that the commercial, legal, reputational
or political consequences of unauthorised disclosure of this data will
be low?
guidance: |
Is there **no risk** that the reputation of the
researcher or data provider will be damaged by this data being made
public, or that legal action can be taken as a result? For example:
* Financial reports that an organisation sells to businesses for
commercial profit
* Anonymised non-controversial user research
yes: tier_2
no: sophisticated_attack
- name: sophisticated_attack
format: markdown
text: |
Do likely attackers include sophisticated, well-resourced and determined
threats, such as highly capable serious organised crime groups and state
actors?
guidance: |
Could this data be used to blackmail, target or persecute individuals?
For example:
* Linking location data to members of a controversial group
* Information on the sexuality of individuals in a region where this
may lead to arrest or abuse
yes: tier_4
no: tier_3
- name: commercial_data
format: markdown
guidance: |
**Commercial-in-confidence data** is information which,
if disclosed, may result in damage to a party’s commercial interest,
intellectual property, or trade secrets.
- name: personal_data
format: markdown
guidance: |
**Personal data** is any information relating to an
identified or identifiable [living individual](#living_individual); an
'identifiable' living individual is one who can be
identified, directly or indirectly, in particular by reference to an
identifier such as a name, an identification number, location data,
an online identifier or to one or more factors specific to the physical,
physiological, genetic, mental, economic, cultural or social identity
of that natural person.
The term 'indirectly' here indicates that this includes data where
identification is made possible by combining one or more sets of data,
including synthetic data or trained models.
- name: pseudonymized_data
format: markdown
guidance: |
**Pseudonymised data** is personal data that has been
processed in such a manner that it can no longer be attributed to a
specific living individual without the use of additional information,
which is kept separately and subject to technical and organisational
measures that ensure that the personal data are not attributed to an
identified or identifiable living individual.
Two important things to note are that pseudonymised data:
* is still personal data - it becomes anonymised data, and is no
longer personal data, only if *both* the key data connecting
pseudonyms to real numbers is securely destroyed, *and* no
other data exists in the world which could be used statistically to
re-identify individuals from the data
* depending on the method used, it normally includes synthetic data
and models that have been trained on personal data. Expert review is
needed to determine the degree to which such datasets could allow
individuals to be identified.
It is important that both researchers and Dataset Providers consider
the level of confidence they have in the likelihood of identifying
individuals from data. Anonymised data is data which under no
circumstances can be used to identify an individual, and this is less
common than many realise ([Rocher et al., 2019](https://doi.org/10.1038
/s41467-019-10933-3)).
Our model specifies three levels of confidence that classifiers can
have about the likelihood of reidentification, with each pointing to a
different tier - absolute confidence, where no doubt is involved, strong
confidence, or weak confidence. Classifiers should give sufficient
thought to this question to ensure they are classifying data to the
appropriate sensitivity.
- name: living_individual
format: markdown
guidance: |
A **living individual** is an individual for whom you do
not have reasonable evidence that they are deceased. If you’re unsure if
the data subject is alive or dead, assume they have a lifespan of 100
years and act accordingly. If you’re unsure of their age, assume 16 for
any adult and 0 for any child, unless you have contextual evidence that
allows you to make a reasonable assumption otherwise ([National Archives,
2018](https://www.nationalarchives.gov.uk/documents/information-management
/guide-to-archiving-personal-data.pdf)).
We can validate that the questions form a DAG with graphlib.TopologicalSorter
with open('data_classification_config.yml') as f:
questions = yaml.load(f, Loader=yaml.FullLoader)
ts = graphlib.TopologicalSorter()
for q in questions:
predecessors = [q.get(True), q.get(False)]
ts.add(q['name'], *predecessors)
list(ts.static_order())
We'd need to pull out "tier_N" predecessors, and go through the list backwards to add questions in reverse order.
We can allow people to write the guidance in markdown like this:
- name: markdown_example
guidance: |
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod
tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim
veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea
commodo consequat.
This guidance has been written in [markdown](https://en.wikipedia.org/wiki/Markdown).
It includes a list:
* Item 1
* Item 2
* Item 3 is **very important**
And we can link to internal guidance like [personal_data](#personal_data) too.
Then convert it into html like this:
import markdown
html = markdown.markdown(questions[18]['guidance']).replace("\n","")
Producing:
<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.</p><p>This guidance has been written in <a href="https://en.wikipedia.org/wiki/Markdown">markdown</a>.It includes a list:</p><ul><li>Item 1</li><li>Item 2</li><li>Item 3 is <strong>very important</strong></li></ul><p>And we can link to internal guidance like <a href="#personal_data">personal_data</a> too.</p>
It might be a good idea to have the app automatically import questions from somewhere like config/default-questions.yml
when it's built so someone who just wants to spin up the app and try it out has something to work with straight away.
@tcouch how do you see this relating to #408 and if they are connected, which may want to come first?
@DavidBeavan I think it'd be easier to develop the code relating to supporting different question sets (as in#408) first. The processes supporting system managers to write/manage/import question sets could then be considered an extension to that.
One thing we might want to look at is having a way to write the data classification questions in a more readable format such as a yaml config file and allow these to be imported.
Update
Automatically import questions when app is built (possibly via migration)