DNSCrypt / dnscrypt-resolvers

Lists of public DNSCrypt / DoH DNS servers and DNS relays
https://dnscrypt.info
1.19k stars 258 forks source link

What is considered logging? (in reference to Cloudflare) #128

Closed brainscar closed 5 years ago

brainscar commented 5 years ago

Hi!

I tried visiting the wiki here on github, but I can't find what your policy is, regarding logging. I'm asking because have some concerns about cloudflare being under the "no logging" label. According to their website they log this:

Cloudflare will collect only the following anonymized DNS query data that is sent to the Cloudflare Resolver:

Timestamp
IP Version (IPv4 vs IPv6)
Cloudflare Resolver IP address + Destination Port
Protocol (TCP, UDP, TLS or HTTPS)
Query Name
Query Type
Query Class
Query Rd bit set
Query Do bit set
Query Size
Query EDNS enabled
EDNS Version
EDNS Requested Max Buffer Size
EDNS Nsid
Response Type (normal, timeout, blocked)
Response Code
Response Size
Records in Response
Response Time in Milliseconds
Response served from Cache
DNSSEC Validation State (secure, insecure, bogus, indeterminate)
PoP ID
Server ID
Autonomous System Number

This seems like enough information to identify someone. I do understand they remove the IP address, as seen here:

There is some telemetry information (i.e. performance related metrics), however, that Cloudflare will store indefinitely as part of its permanent logs in order to assist Cloudflare in enhancing the overall performance of Cloudflare Resolver and identifying security threats. Cloudflare will only store permanent logs of the following such information:

My point here is this: the reason people worry about their ip address being logged, is beause it is considered 'identifying information'. However, if you look at that list above, there are several things in there that can identify someone easily.

Which they actually admit to being able to do in the bold section here:

Total number of queries with different protocol settings (e.g tcp/udp/dnssec) by Cloudflare PoP Response code/time quantiles with different protocol settings by Cloudflare PoP Total Number of Requests Processed by Cloudflare PoP Aggregate List of All Domain Names Requested, and timestamp of first time requested -----> Number of unique users <-----, queries over IPv4, queries over IPv6, queries with the RD bit set, queries asking for DNSSEC, number of bogus, valid, and invalid DNSSEC answers, queries by type, number of answers with each response code, response time quantiles (e.g. 50 percentile), and number of cached answers per minute, per day, per protocol (HTTPS/UDP/TCP/TLS), per Cloudflare data center, and per Autonomous System Number. Number of queries, number of queries with EDNS, number of bytes and time in answers quantiles (e.g. 50 percentile) by day, month, Cloudflare data center, and by IPv4 vs IPv6. Number of queries, response codes and response code quantiles (e.g. 50 percentile) by day, region, name and type.

If they can identify unique users, and keep all the information above (some of it permanently), my suggestion is to reconsider putting them under "no-logging".

Regardless, I trust your opinion.

Source: https://developers.cloudflare.com/1.1.1.1/commitment-to-privacy/privacy-policy/privacy-policy/

publicarray commented 5 years ago

Good question! I actually never read their policy but from this I'm not sure. I suppose it depends on an individuals thread model. To play it safe I agree we could remove the non-logging label.

Just as a reference here is what I'm doing on my server: https://dns.seby.io/stats.html All this really shows is how the server and the clients are behaving. I'm pretty sure that it's impossible to identify someone from these graphs. This is the only data I have. I use it to see how popular the service is and if I need to take manual action (e.g. when the graphs go down and stay at 0 or sky-rocket and someone is abusing the service)

From my graphs I could get aggregate data on the following :

Timestamp (in a few minute increments)
Query Type
Query Class
Query Rd bit set
Query Do bit set
Query EDNS enabled
Response Code
Response Time in Milliseconds
Response served from Cache
DNSSEC Validation State (secure, bogus)

I don't consider this as logging but I'm technically logging some information so maybe I should remove the no-logging label too? I don't know. It depend on an individuals thread model.

Maybe we should define logging such that if it's possible to identity a unique user or query from the logs it's logging else its non-logging? That definition still doesn't help much though.

For Cloudflare I think they may use unique identifiers to determine unique users in the 24 hour period. Than after 24h they just increment the "Number of unique users" counter. I don't know but I'm speculating. I do think they are pushing the no-logging envelope a bit though.

@jedisct1 What do you think

jedisct1 commented 5 years ago

There has never been a formal definition of a non-logging resolver, but this is a very important topic, and something that we should define all together.

Logging the client IP address, even temporarily, should probably clear the 'non-logging' bit immediately.

Now, what about logging queries and responses?

Even without client IP addresses, this can leak sensitive information.

While a unique sequence of queries does not reveal the client IP, it reveals when that device is online.

More importantly, DNS queries, even to nonexistent names, reveal information about the network, what software is being used and more.

For example, queries for testing-secret-internal-project.bankofamerica.com could reveal the address of something that was originally not supposed to be public.

Another issue is that when a query for a nonexistent name is made, operating systems can be configured to retry using the "default" domain (or even a set of domains, e.g. with the search property in resolv.conf). So, a Bank of America employee trying to access hardcorefishrubbingfetish.com would send a query for that name first, and fall back to a second query for hardcorefishrubbingfetish.com.bankofamerica.com.

While the first query doesn't reveal much information about the identity of the client, the second does.

A third issue, similar to the previous one, is browser autocompletion, that can also trigger the default suffix. So that search queries can end up as queries for <search query>.bankofamerica.com.

Unfortunately, this information is already public. Sensors recording queries and responses sent to authoritative servers are everywhere. Companies such as Cisco and Farsight log everything the see and sell access to their database. This data is stored forever. There are also many free services doing the same. This is very useful for security and marketing purposes.

Even data sent to a resolver that doesn't log may end up in these databases, because the sensors are placed between the resolvers and the authoritative servers, not between the client and the authoritative servers.

So, the consensus in the DNS community, maybe as a way to downplay the fact that DNSSEC doesn't provide any confidentiality, or that names can be brute-forced, has always been that "DNS data should be considered public".

If we agree with that, maybe the definition of "doesn't log" can just be "doesn't log the client IP, even temporarily".

brainscar commented 5 years ago

Thank you both so much for your responses, I really appreciate the open discussion we're having.

I think this topic goes beyond just cloudflare, and that was not my intention to single them out.

In terms of what is considered logging I think there are at least 3 instances that we're dealing with:

Which, begs the question: at what point does it become too much?

I agree with @jedisct1 about this:

Unfortunately, this information is already public.

For example, testing-secret-internal-project.bankofamerica.com could also be found by things like:

(sorry @jedisct1 no fish rubbing at github yet.)

So in that sense I would agree with "ip logging is considered logging".

However, I think when we look at the list cloudflare logs, I do believe there is more to worry about than just queries and responses.

And that's where I would love to get your input about @jedisct1 and @publicarray.

You see if the query is public data but the ip address isn't, one could argue:

anyone could have made that request.

However if we look at that list, I don't think that statement applies anymore. After all, if you narrow it down, that list is essentially an unique fingerprint, which then becomes attached to the query. And that's my concern: being able to put query and person together.

Thank you guys again, I hope we can continue this conversation.

jedisct1 commented 5 years ago

The information Cloudflare logs doesn't seem to be enough to passively link queries to users, so the Number of unique users mention in their privacy policy is a bit concerning.

Maybe they make an rough estimate based on the number of queries, and the fact that on average, a user makes x queries per day.

Or maybe they temporarily use client IP addresses, independently from the payloads they send and receive, for throttling and DoS mitigation. That can be implemented at any layer, but a firewall rule that prevents a single client IP to send tons of queries in a short time fits in this category. Using client IP addresses that way is probably fine and should not void the "non logging" flag.

Number of unique users in their policy may refer to this.

Rather than speculating, maybe @vavrusa can clarify what exactly gets logged and what Number of unique users refers to?

irtefa commented 5 years ago

Hi,

I am the product manager for the 1.1.1.1 team. I can see why this can be confusing. We don't store anything that can actually tell us how many unique users we have for the public DNS resolver. We do internally sometimes make rough estimates based on the number of queries.

Here's what we actually log:

We will work on making this clearer in our privacy policy.

jedisct1 commented 5 years ago

Thanks a lot for chiming in and for the clarification, Mohd!

So, shall we define "non-logging" as "doesn't log or use the client IP address, except for rate limiting, and without correlation with DNS queries"?

What do you think?

The "non-logging" bit is important, if only because by default, dnscrypt-proxy ignores resolvers having that bit set (and we probably shouldn't change that).

irtefa commented 5 years ago

"non-logging" as "doesn't log or use the client IP address, except for rate limiting

Yes. IMO, that's fair.

publicarray commented 5 years ago

Yes I’m happy with that 👍

brainscar commented 5 years ago

@irtefa could you please confirm the end of the sentence applies to cloudflare too?

doesn't log or use the client IP address, except for rate limiting, and without correlation with DNS queries.

Then as far as my opinion goes, I'm good with it too, as my only concern left was the one @jedisct1 mentioned here: https://github.com/DNSCrypt/dnscrypt-resolvers/issues/128#issuecomment-494768532

irtefa commented 5 years ago

That's correct. We may use the IP address for rate limiting but we don't log them. Furthermore, they are not associated with DNS queries.

captn3m0 commented 5 years ago

How about changing "log" to retain?

"doesn't retain the client IP address, except for rate limiting, and without correlation with DNS queries"?

For DoH resolvers, even things like User-Agent + ASN might be enough to identify users. so changing client IP address to "user identifiable information" might be better.

The Mozilla DoH resolver policy takes it up nicely: https://wiki.mozilla.org/Security/DOH-resolver-policy