18F / privacy-tools

GSA PII Dashboard
https://cg-9341b8ea-025c-4fe2-aa6c-850edbebc499.app.cloud.gov/site/18f/privacy-dashboard/
MIT License
2 stars 4 forks source link

Research: What is the simplest way to get PII from the list of Categories of Records? #7

Open ondrae opened 4 years ago

ondrae commented 4 years ago

What: Research: What is the simplest way to get PII from the list of Categories of Records?

Depends on:

5 and https://github.com/18F/privacy-tools/issues/6

Why: If our assumptions about OMB A130 are correct, then we need a repeatable way to turn categories of records from PIAs and SORNs into an inventory of PII. Something we can turn into code would be best. Instructions on how to do it by hand work too, just less enticing for a new agency wanting to use our service.

What: Just an example of one way: Find some official NIST or GSA list of PII. We compare that official list against the categories of records, keeping only the matching PII. If no official list exists, make your own list. Using your expertise, choose what is PII and what isn’t. The GSA privacy office would probably love to help. They could maybe even do it for you?!?

Try to avoid anything complicated, like combinations of records that become PII.

Acceptance: We will have an understanding of the suggested approach. The partners have agreed to this approach.

nikzei commented 4 years ago

@nikzei and @peterrowland to pair on tightening up AC.

peterrowland commented 4 years ago

Two examples of previous projects that used natural language processing to categorize text data into consistent categories.

https://github.com/GSA/calc/pull/997 Tl;DR: This is a clever method to do broader matching of terms by filtering out words that are uncommon in their dataset and then tries different combinations of remaining terms to look for a category match. Requires more data, but the method of stripping out uncommon words and trying different combinations may be worth considering as a way to do more generalized matching.

https://github.com/18F/10x-ssp-parse-prototype Tl;DR: This project scrapes narrative controls text contained in SSP documents and uses popular natural language processing (NLP) libraries to quantify similarity between texts. This method isn't applicable to matching terms like categories, but would be useful for comparing fields in SORNs and PIAs.

peterrowland commented 4 years ago

Marcela referred me to the Commodity Futures Trading Commission as an example of plain-language terms for PII. https://www.cftc.gov/Privacy/cftcpia/index.htm https://www.cftc.gov/media/2001/piaems051019/download

peterrowland commented 4 years ago

The question surfaced: Do SORN Categories of Records == PII?

Privacy Act defines a record as: "any item, collection, or grouping of information about an individual that is maintained by an agency..."

and a System of Record as:

a group of any records under the control of any agency from which information is retrieved by the name of the individual or by some identifying number, symbol, or other identifying particular assigned to the individual;

https://www.law.cornell.edu/uscode/text/5/552a

If Privacy Act 'records' are personal information, should we consider personal information PII?

GAO report (08-536) uses the 'Personal Information' and 'Personally Identifiable Information' interchangeably, and uses this definition:

includ[es] (1) any information that can be used to distinguish or trace an individual’s identity, such as name, Social Security number, date and place of birth, mother’s maiden name, or biometric records; and (2) any other information that is linked or linkable to an individual, such as medical, educational, financial, and employment information. https://www.gao.gov/assets/gao-08-536.pdf - p.1

NIST's guidance on protecting PII (800-22) references this definition and goes into detail on what information can be used to distinguish or trace and individual, and what linked or linkable means. https://nvlpubs.nist.gov/nistpubs/Legacy/SP/nistspecialpublication800-122.pdf - p 2-1

We should ask Richard or Marcela to confirm if GSA also uses this definition.

ondrae commented 4 years ago

@peterrowland Thank you for this research.

I was wrong in my assumption that Categories or Record meant something different enough from PII that we should treat them different. Based on what you found above, I'm going to start talking about them both as pretty much the same thing, and will use the terms interchangeably.

Is there anything else you want to do before we close this issue?