Closed mcsaucy closed 4 years ago
@mcsaucy I haven't done a deep review of the entire repo but from a code maintenance perspective it would be safer and easier to maintain if the PII checks were done at the shared record level instead of the scraper level. I don't think that class is shared across scrapers but ideally there would be 1 layer that handles anonymization so we don't have to integrate the PII class across every single integration that gets completed. Thats how leaks and security holes are created.
We don't currently have "1 layer" to do this. Would moving our record definition into common
so other scrapers can use it satisfy that requirement? Are you cool with this approach otherwise?
The only issue I can foresee is that the fields collected might vary by scraper.
To give one example, I felt for the Benchmark portal it was helpful to store both the Florida "Uniform Case Numbering System" Case ID as well as the Portal's Case ID, and there was an "Agency Report Number" field on the website, which I thought could be helpful for cross-referencing cases to police report numbers in case we ever get our hands on police report dataset.
These were not originally listed in the minimum scraped fields list.
Basically, we cannot necessarily be sure Record
and Charge
will have the same fields for each state.
In principle a shared common
definition has advantages, but the ability to extend the dataclasses with extra fields may be necessary.
So do we think there's value in a dataclass with the minimal fields (and PII types) in common
that scrapers can extend as needed?
Yes, that sounds like a good idea.
In practice can you extend dataclasses?
So could I go:
@dataclass
class BenchmarkRecord(Record):
new_field: type
to add new fields ontop of those minimal fields/PII types?
In practice can you extend dataclasses?
Looks like it. I've made a base record in common
and extended it with the extra fields we're tracking.
LGTM!
Looks like it does everything we want it to do, it provides a shared record as suggested by @salter3, it necessitates all required DB keys, it allows for the protection of PII fields and it allows for extensibility with extra portal-specific fields.
Let's squash and merge :)
Turns out Github's "squash and merge" option works well.
This change centralizes PII handling with a series of wrapper classes. These classes selectively store or redact data based upon the
--collect_pii
flag. This lets the actual scraping code just grab everything, throw it in the PII wrappers and trust that they'll Do The Right Thing, without materially changing the usage of the PII wrapped data. This is accomplished by subclassing the data we wish to encapsulate. APii.String
is a subclass ofstr
, for example.PII handling subclasses are cool and all, but they aren't really useful if they don't get consistently used. To that end, this change alters our
Record
dataclass to include complete type hints for all fields and ensures the field values match the type at runtime. Attempting to shove a normal string into aPii.String
-typed field will result in aTypeError
, for example.While we're here, alter our
Record
andCharge
definitions to call out optional fields and expose builder objects for them. Given the number of fields these have, this is less error prone than setting a bunch of individual variables and plugging them into our dataclass constructors at the end.