digitalmethodsinitiative / 4cat

The 4CAT Capture and Analysis Toolkit provides modular data capture & analysis for a variety of social media platforms.
Other
241 stars 58 forks source link

Dataset encryption #423

Open stijn-uva opened 5 months ago

stijn-uva commented 5 months ago

Currently, datasets are stored without any form of encryption, allowing anyone with access to the file to view the data.

This is OK in many circumstances, since datasets can be made private (i.e. only available to owners). But even then people with filesystem access to the server, as well as server admins, still have access to the files.

For sensitive data this is potentially problematic. The solution is to run your own 4CAT, but this is not always feasible, and even then in some circumstances encrypted file storage might be preferred (because this is an organisational requirement, etc). If we go forward with the media upload datasource (#419) people might upload sensitive data collected elsewhere (e.g. from recorded interviews), and it would be useful if secure storage could be offered for such data.

Since we already use zip files to store various types of datasets, an obvious solution would be to use encrypted zip archives. For datasets not currently stored as zip files the various methods to access the data (iterate_items, etc) could be amended to transparently store in and read from encrypted zip archives.

Python's native zipfile does not support encrypted archives well, but for example pyzipper seems to be a robust and mostly drop-in alternative.

A question is how to handle access to the archive. To run processor on encrypted data, the encryption key would need to be available on the server, at least temporarily. We already have some code in place to handle credentials for APIs et cetera, which are kept on disk as briefly as possible and deleted once no longer necessary. A similar compromise could be used here.

sal-uva commented 5 months ago

I'd be super for this! Would allow 4CAT to be used in many different research contexts.

I guess the last point is possible as 'dataset passwords' that allow you to both access and decrypt datasets? Which can be stored as a cookie and deleted from the server after iterating over a dataset?

stijn-uva commented 5 months ago

Yes, we currently have the sensitive and cache options for processor input fields, which together make 4CAT handle input this way:

https://github.com/digitalmethodsinitiative/4cat/blob/6d8cb067bc12f8be68749f74a7291e0849494225/backend/lib/processor.py#L178

https://github.com/digitalmethodsinitiative/4cat/blob/6d8cb067bc12f8be68749f74a7291e0849494225/webtool/static/js/fourcat.js#L385