Police-Data-Accessibility-Project / scrapers

Code relating to scraping public police data.
https://pdap.io
GNU General Public License v3.0
160 stars 37 forks source link

Add constructs for safe handling of PII. #28

Closed mcsaucy closed 4 years ago

mcsaucy commented 4 years ago

This change centralizes PII handling with a series of wrapper classes. These classes selectively store or redact data based upon the --collect_pii flag. This lets the actual scraping code just grab everything, throw it in the PII wrappers and trust that they'll Do The Right Thing, without materially changing the usage of the PII wrapped data. This is accomplished by subclassing the data we wish to encapsulate. A Pii.String is a subclass of str, for example.

PII handling subclasses are cool and all, but they aren't really useful if they don't get consistently used. To that end, this change alters our Record dataclass to include complete type hints for all fields and ensures the field values match the type at runtime. Attempting to shove a normal string into a Pii.String-typed field will result in a TypeError, for example.

While we're here, alter our Record and Charge definitions to call out optional fields and expose builder objects for them. Given the number of fields these have, this is less error prone than setting a bunch of individual variables and plugging them into our dataclass constructors at the end.

salter31 commented 4 years ago

@mcsaucy I haven't done a deep review of the entire repo but from a code maintenance perspective it would be safer and easier to maintain if the PII checks were done at the shared record level instead of the scraper level. I don't think that class is shared across scrapers but ideally there would be 1 layer that handles anonymization so we don't have to integrate the PII class across every single integration that gets completed. Thats how leaks and security holes are created.

mcsaucy commented 4 years ago

We don't currently have "1 layer" to do this. Would moving our record definition into common so other scrapers can use it satisfy that requirement? Are you cool with this approach otherwise?

OscarVanL commented 4 years ago

The only issue I can foresee is that the fields collected might vary by scraper.

To give one example, I felt for the Benchmark portal it was helpful to store both the Florida "Uniform Case Numbering System" Case ID as well as the Portal's Case ID, and there was an "Agency Report Number" field on the website, which I thought could be helpful for cross-referencing cases to police report numbers in case we ever get our hands on police report dataset.

These were not originally listed in the minimum scraped fields list.

Basically, we cannot necessarily be sure Record and Charge will have the same fields for each state.

In principle a shared common definition has advantages, but the ability to extend the dataclasses with extra fields may be necessary.

mcsaucy commented 4 years ago

So do we think there's value in a dataclass with the minimal fields (and PII types) in common that scrapers can extend as needed?

OscarVanL commented 4 years ago

Yes, that sounds like a good idea.

In practice can you extend dataclasses?

So could I go:

@dataclass
class BenchmarkRecord(Record):
    new_field: type

to add new fields ontop of those minimal fields/PII types?

mcsaucy commented 4 years ago

In practice can you extend dataclasses?

Looks like it. I've made a base record in common and extended it with the extra fields we're tracking.

OscarVanL commented 4 years ago

LGTM!

Looks like it does everything we want it to do, it provides a shared record as suggested by @salter3, it necessitates all required DB keys, it allows for the protection of PII fields and it allows for extensibility with extra portal-specific fields.

Let's squash and merge :)

mcsaucy commented 4 years ago

Turns out Github's "squash and merge" option works well.