Enable datasource-specific fields for pseudonymisation

sal-uva commented 4 months ago

Datasource scripts should be able to register what csv/ndjson fields are sensitive and should be considered when pseudonymising a dataset. Simply hard-coding author fields doesn't cut it; e.g. the Tumblr data source has a field called post_url that can easily be used for de-anonymisation.

dale-wahl commented 4 months ago

The remove author procesor was modified to allow users to choose which fields to address. It allows you to choose any field for a CSV, but with JSON you have to list possible fields in the pseudonymise_fields attribute of the datasource. You can add fields there (Tumblr is CSV so you can already use this processor to anonymize post_url).

It is non trivial to actually update the JSONs by field. map_item does not work in reverse (we would need write a reverse version of the function for each datasource and maintain it). You can get all the field names via map_item, but you cannot remove the underlying original data without their actual keys. You can also get the actual keys, but the user would need to view the JSONs themselves in order to really understand what the fields contain as they may be named differently from the map_item variant. I am not sure of a good solution or implementation for this.

I think the most important thing is to ensure data is secure in 4CAT and that researchers are well aware of what is contained in their data before they publish it. It is, unfortunately, trivial in most cases to de-anonymize a post with URL, ID, or even the combination of text body and timestamp. And, if only psuedonymise was used, you can use the author hash to find everything else from the author (this is why I added the option to redact information instead of just hashing it).

sal-uva commented 4 months ago

Thanks for the suggestions. I meant more pseudonymisation when the dataset is created; 4CAT should handle pseudonymisation as a default instead of having to rely on users to go through the (sometimes complicates) hoops of doing so. And not offering the possibility to pseudonymise data from the NDJSON is also a problem.

I.e. I'm suggesting that a pseudonymise_fields attribute should enable data source scripts to set both JSON keys and CSV column names to pseudonymise once the dataset is uploaded (e.g. all keys or columns starting containing the word blog for Tumblr; I guess you'll have to loop through the a nested dict once to find where these are, or alternatively let the developer point to their exact location). There should probably be a hook somewhere here in search.py that takes this into account.

Another alternative is to run the remove authors processors straight after the dataset is created.

dale-wahl commented 4 months ago

Pseudonymise is already the default option on 4CAT created datasets, but it looks like the hook for Zeeschuimer works differently than 4CAT created datasets. The Zeeschuimer hook works by running the remove authors info processor after creation. I guess the 4CAT created datasets was never updated to use the pseudonymise_fields field and only looks for author keys.

The existing code searches through the nested dicts for any key containing the provided term (e.g., author removes any nested key containing the substring author). The remove author info processor works the same way. So to accomplish what you are looking for, I think all you would need to do is update the 4CAT created datasources to run the remove authors info processor in the same way Zeeschuimer datasources do and update any particular pseudonymise_fields to datasets in question.

dale-wahl commented 4 months ago

One final consideration: dict_search_and_update is very destructive. In many cases, datasources would have objects like "author": {"name": "jim", "hometown": "somewhere"}. I wrote it to pseudonymise or redact all nested data when finding a match. So author would in that example address both jim and somewhere. All that to say, be careful what keys you feed it!

sal-uva commented 4 months ago

Thanks! I'll try to integrate this soon.

And being greedy in what to pseudonymise is definitely the way to go here imo; if relevant, non-personal data is mistakenly hashed, we can always specify the fields later.

stijn-uva commented 4 months ago

Should we want to anonymise JSONs after all?

Often we only have a vague idea of what the JSON will look like; we assume some part of its structure is guaranteed (the parts map_item wants) but more or less ignore the rest. This has been useful because that flexibility affords the combination of Zeeschuimer and 4CAT that has been quite productive. But it also means we are essentially trying to anonymise data of which we don't actually fully know what parts should and should not be anonymised. This will be an increasingly large maintenance burden as we add more data sources to Zeeschuimer, plus it arguably gives a false sense of security to users if we promise their data has been anonymised but we actually aren't entirely sure that it is. We then re-assign some of that blame to users by allowing them to specify which fields should and should not be anonymised, but most users will not know the data structure well enough to do that properly especially for complicated structures like LinkedIn's.

Some options:

423, i.e. allow for datasets to be encrypted so sensitive data is less accessible (though this would still mean personal data remains on the server, potentially, which data minimisation enthusiasts will not approve of)
Add a feature to discard the underlying NDJSON of a data source and keep only the CSV, for which we have much more precise control over what is and is not anonymised via map_item, at the loss of some flexibility (because unmapped data is irreversibly lost)
Accept the maintenance cost and periodically check the structure of the data to update a pseudonymise_fields attribute of the data source

sal-uva commented 4 months ago

For me, ethically and legally speaking, not allowing pseudonymisation of all data that is collected is a real no-go, even if encrypted (though that feature should also be implemented). Just personally speaking, I'm not comfortable with my phone number being stored in a JSON by what should be an "ethics by design" research tool.

Option 2 could indeed work -- I can't think of many realistic scenarios where users would want to retain the NDJSON. Then the pseudonymise option would become something like "Pseudonymise to new CSV and discard original JSON/CSV".

But we more or less have already accepted the maintenance cost of checking changes in the NDJSON data with the introduction of map_item, right? So why not slightly expand the scope of this? I also don't think that user data will shift around that much; in my experience most API changes concern stuff like new post content.

Allowing both option 2 and option 3 is also a possibility.

stijn-uva commented 4 months ago

But is it ethics by design to promise anonymised data when we cannot guarantee that it actually is anonymised?

There is indeed a maintenance cost already in keeping map_item up to date, but that is more clearly demarcated: there is an explicit list of fields we want to keep and we know which fields in the underlying JSON data they map to. Whether updates are needed is also relatively easy to detect, because the function will stop working if the part of the JSON data it relies on change. We can effectively ignore the rest of the JSON; the part that we do not need for mapping is not our concern, and thus we also don't really know whether there is sensitive data in there or not.

If we decide we do need to know that, to be able to effectively anonymise it, we now need to understand the full data object rather than just the 'interesting' parts, and proactively check if the data that gets uploaded from Zeeschuimer has a different structure or has added fields that we should consider. Otherwise we again cannot guarantee that our predefined list of 'anonymisable fields' is actually correct and the promise it has of properly anonymising the data cannot be relied on.

sal-uva commented 4 months ago

I agree that Option 3 is more maintenance (especially when thinking of LinkedIn), but if we don't go that way I'd vouch for Option 2 as the default when creating/importing datasets; at least then we can guarantee pseudonymisation.

...which does raise the question of why we decided to save 'raw' NDJSONs in the first place, but that's another story!

sal-uva commented 4 months ago

(Option 2 would still benefit from datasource-specific sensitive fields btw, but then it can just be a simple list of column names)

stijn-uva commented 4 months ago

...which does raise the question of why we decided to save 'raw' NDJSONs in the first place, but that's another story!

Two reasons:

The general principle that you want to stay as close as possible to the 'original' data because you do not always know ahead of time what part of it is interesting.
It decouples data collection and data mapping, so that an issue with the data mapping does not prevent data collection from completing or vice versa: if there is a bug in the mapping it does not prevent someone from capturing data and e.g. a long-running data capture will usually not crash at the last minute because some field has a different name than expected.

sal-uva commented 4 months ago

These are good reasons, but:

Both map_item and pseudonymisation always push us to make the choice of what parts of the data is interesting.
Decoupling data collection and data mapping is more a matter of facilitating debugging. To retain this, and to make sure long-running captures don't fail outright, I'd propose to 1) add a setting on the default pseudonymisation option (which we can set to 'none' in development contexts to make our lives easier) and 2) to make sure that failures in map_item only skip that particular value but still save the others (was this not originally the case?).

dale-wahl commented 4 months ago

I definitely think we could and should encrypt the data. Most users do not anonymize their data even with it as the default. I was just discussing with students over summer school who did not understand why they did not have usernames and pointed out that they did not need them for what they were trying to accomplish. They re-collected their data.

Decoupling data collection and mapping allows researchers to decide what part of the data is interesting and relevant to their research. None of the data is important to me. I add fields to map_item at the request of others. Still, Option 2 is more feasible with the downside that, if there is additional interesting data, data collection will have to be conducted again after it is mapped.

I think what is most important is that we provide robust tools for researchers to make the decision about what needs to be removed and anonymized. We should also make recommendations as we are familiar with the data and those can be the defaults. To Stijn's point, we really cannot guarantee that we are effectively removing all relevant information. In my opinion, it takes very little information to being de-pseudonymzing the datasets.

sal-uva commented 4 months ago

I mean, we can do our best to guarantee there's no clearly identifiable data by default, and I think we should. Of course, with effort, almost any data can be de-anonymised, but that doesn't mean that 4CAT shouldn't help in facilitating this as the standard option.

I don't see the re-collecting of data as a huge problem in case new stuff needs to be added to map_item -- researchers shouldn't be collecting enormous dataset without knowing what will be in there beforehand, anyway.

So if I'm correct, the consensus seems: option 2, which 1) works with pseudonymse_fields 2) by default deletes the original data and 3) also allows to retain all data the way it's collected, as long as it's an active choice by the user. Agreed?

stijn-uva commented 4 months ago

I don't think we're agreed on all details, let's discuss this offline...

digitalmethodsinitiative / 4cat

Enable datasource-specific fields for pseudonymisation #438

423, i.e. allow for datasets to be encrypted so sensitive data is less accessible (though this would still mean personal data remains on the server, potentially, which data minimisation enthusiasts will not approve of)