Improve data collection practices

andrewdavidwong commented 6 years ago

From https://github.com/QubesOS/qubes-doc/pull/649#discussion_r187784320:

We keep the log for both statistics and audit reasons. I'm not sure how long we keep them, but I think at least 2 months, but in any case the statistics script does not touch logs older than 3 months.

The log is kept unencrypted and that includes IP addresses. They are looked at by human only in case of audit/postmortem (this happened at least once).

I think it's very important that we improve our practices in this area. Thankfully, we don't have to handle much data about users, but with respect to the data about users that we do handle, we should:

Set clear written policies for data access and retention.
Publish those policies publicly (see #1624).
Store the data securely (e.g., encrypt the IP addresses and/or store only hashes of them).
Ensure that only the minimum required number of authorized personnel are allowed access to the data, and keep logs of that access.
Securely destroy the data as soon as it's no longer needed (e.g., after we're done calculating the userbase estimate for each month, or as soon as our auditing requirements allow).

I'm assigning this to the "Documentation/website" milestone, but this primarily applies to the data we collect from the Qubes update servers. Since we don't host the Qubes website ourselves, we don't have any access to or control over the data generated when people visit the website.

CC: @rootkovska, @marmarek, @woju, @mfc

marmarek commented 6 years ago

We don't have full control over this. The logs we have are mostly the thing that anyone on the network path could observe (the fact that given IP address connects to qubes updates server). Also, we have no direct control what mirrors providers store. Even mirrors.kernel.org does not have privacy policy, leaving alone other mirrors.

Following our distrust in infrastructure, even if we publish such rules, there is no reason to believe that they are respected. A explained there, there are multiple areas that we don't control and multiple methods that even things we somewhat control, could be taken over. For example service provider we use for the updates server could silently take snapshots of memory and disk of that machine and we'd never know.

Providing a statement, which we don't really have means to keep up with, would be irresponsible from our side.

I propose to not publish any statement about what data we keep and instead add an FAQ entry why we don't have one. And hint users that if they want to hide their IP when downloading updates, there is an option to use Tor.

andrewdavidwong commented 6 years ago

We don't have full control over this. The logs we have are mostly the thing that anyone on the network path could observe (the fact that given IP address connects to qubes updates server). Also, we have no direct control what mirrors providers store. Even mirrors.kernel.org does not have privacy policy, leaving alone other mirrors.

Following our distrust in infrastructure, even if we publish such rules, there is no reason to believe that they are respected. A explained there, there are multiple areas that we don't control and multiple methods that even things we somewhat control, could be taken over. For example service provider we use for the updates server could silently take snapshots of memory and disk of that machine and we'd never know.

If I understand correctly, the argument is: Since we cannot guarantee that third-parties will handle user data carefully, we should not bother handling user data carefully either. I don't think that's the right way to think about this. What matters is not whether we have total control over the data, but what we do with the data that we do have control over. In other words, we should be responsible for what we control, regardless of whether other entities are responsible for what they control. The fact that other entities could be irresponsible does not absolve us of our own responsibility.

Think about it this way: If a user asks, what do you (i.e., the Qubes team) do to protect the data about me that you collect from the updates servers, then the truthful answer will have to be "nothing" (or "close to nothing). If a user asks, "Why don't you encrypt my IP address before storing it?" then our answer is, "Because someone else could store your IP address unencrypted." But that doesn't really make sense. Even if we can't solve the entire problem, we can at least refrain from exacerbating it.

Providing a statement, which we don't really have means to keep up with, would be irresponsible from our side.

That's not what I'm proposing. I'm proposing that we set a policy for what we do control that we can keep up with.

It's fine to say, "Look, there are all sorts of ways in which this data could be intercepted before it gets into our hands, but once it gets into our hands, we try to handle it with care."

andrewdavidwong commented 6 years ago

We distrust the infrastructure for good reason, but insofar as we are part of the infrastructure, we should try to be trustworthy. There are some systems that we could, in principle, run in a trustworthy way ourselves (e.g., hosting the website and the mailing lists), but doing so would be prohibitively expensive (in both time and money), so we offload it to the distrusted infrastructure. If improving our data collection practices with respect to the Qubes update servers would also be prohibitively expensive, then it would be consistent to adopt the position that users would be better served by us spending our time on things that we know will significantly benefit their privacy and security, rather than on strengthening the one link we control in a very weak chain. It's an empirical question whether this is the case. It's also conceivable that we could become the weak link in the chain (e.g., if we do nothing while other parts of the infrastructure are compelled by law to adopt better privacy practices) or that we have a measure of control in the selection of the other links we associate with (e.g., choosing to use privacy-respecting service providers).

andrewdavidwong commented 6 years ago

This might be helpful for logs:

https://github.com/efforg/cryptolog

QubesOS / qubes-issues

Improve data collection practices #3895