caddyserver / caddy

Fast and extensible multi-platform HTTP/1-2-3 web server with automatic HTTPS
https://caddyserver.com
Apache License 2.0
58.73k stars 4.05k forks source link

Log Privacy #1769

Closed myReaper closed 6 years ago

myReaper commented 7 years ago

Hi,

i just add this directly as "feature request" as I can't find any option for this in the documentation :-)

For privacy reasons it is crucial to strip the last or the last two octets of an IP address in the access.log . I'm currently using the "{combined}" format to log my websites and can't find any option in the documentation to strip them.

I think this is also mandatory for many caddy users because of privacy regulations in many countries.

Best Regards

francislavoie commented 7 years ago

I think adding an additional replacer as an alternative to {remote} would be the solution here. Maybe something like {remote_private}? That way you can just specify the log format by hand, for example:

log / access.log "{remote_private} - - [{when}] \"{method} {uri} {proto}\" {status} {size}"

Relevant spots in the code, for reference: https://github.com/mholt/caddy/blob/a6ec51b34931aed508e35563dc82641b9bd6faa9/caddyhttp/httpserver/replacer.go#L264 https://github.com/mholt/caddy/blob/f32eed1912c3a7a2f60dd5d489123ae3491586b3/caddyhttp/log/log.go#L80

myReaper commented 7 years ago

I'm no programmer, so I can't tell you what code-vise the best option would be :-)

Just from a user perspective, it would be great to still use the {combined} option and maybe just set an additional parameter like "anonymize-ip 1".

With the option to set 0 (default, for logging all octets) to 4 depending on how many octets i would like to anonymize (1 for 123.123.123.0, 3 for 123.0.0.0)

dmke commented 7 years ago

Please note that for IPv6, the last two octets are not enough to get satisfying results. To quote this SO answer:

At the very least you want to strip the EUI-64 off, i.e the last 64 bits of the address. more realistically you want to strip quite a lot more to really be private, since the remaining part will still identify only one subnet (i.e. one house possibly)

[...]

You can implement this with a bitmask exactly like you would for an IPv4 address, the question becomes a legal one though of "how much do I need to strip to comply with the specific legislation", not a technical one at that point though

Based on the suggestion, I propose the following config block options:

log path file [format] {
  rotate_...
  anonymize_v4    number-of-bits   # 0..32
  anonymize_v6    number-of-bits   # 0..128
}

This could relatively easy be implemented using net#CIDRMask and net.IP#Mask. Here's a small demo: https://play.golang.org/p/Ib6y4FwrCt

If performance is critical, this approach might not be suitable, since it requires parsing the RemoteAddr via net#ParseIP, and both net.IP#Mask and net.IP#String are not allocation-free...

tobya commented 7 years ago

Excuse my ignorance in this debate, but is it simply possible to leave out the ip address from the logs entirely or do you need the anonymized IP address for some reason?

myReaper commented 7 years ago

I think the {combined} format is standardized (or like-standard) so many tools (like GoAccess, Webalizer, ... ) can directly use/parse this format without the need to configure them a special syntax of your log file.

Also it might be interesting to even just have the first two octets instead of none for statistic reasons.

tobya commented 7 years ago

The IP address can be left out and - inserted and should still be able to be read by Log analysis software.

Otherwise if we are to implement it,

My view would be we should keep it as simple as possible.

{remote_anon} should be the industry standard anonymous IP without configuration options

perhaps 231.234.XXX.XXX or whatever is the standard then

{combined} would remain the same but {combined_anon} would replace {remote} with {remote_anon}.

This i think would be a tidy way of doing it.

dmke commented 7 years ago

The beauty of a bitmask is, that by setting it to 0, no anonymization takes place, and we wouldn't need {remote_anon} and/or {combined_anon} replacements.

[...] the industry standard [...] [...] perhaps 231.234.XXX.XXX or whatever is the standard then

This is a legislative problem, not a technical one. There simply cannot be a "standard" value (except maybe "don't log the remote address at all").

tobya commented 7 years ago

This seems to me like something that could be a 3rd party plugin if anyone was interested in implementing it. It may however run into similar issues as #1542 with getting access to the replacer from the correct context.

If someone could look at a creative solution to that problem it would make plugins that modify placeholders able to be implemented.

myReaper commented 7 years ago

@tobya To me this is some basic functionality of a webserver, also because it is mandatory by law in many countries. I would not be willing to add any 3rd party plugin into caddy for this. I think this should get integrated directly into caddy.

mholt commented 7 years ago

Is this feature mainly being requested by employees of companies that require this kind of redaction?

myReaper commented 7 years ago

I think this feature is generally required when using caddy in germany (or most of the EU), regardless if you are running a private website or a business website. It's the law that's forcing every website to anonymize the logged ip addresses.

mholt commented 7 years ago

@myReaper Do you have a link to the specific law you are referring to? I want to make sure I'm understanding the right thing.

magikstm commented 7 years ago

@myReaper may be able to provide the exact law he needs to abide to.

I read these some weeks ago: https://arstechnica.co.uk/tech-policy/2016/10/eu-dynamic-static-ip-personal-data/ https://www.bna.com/ip-addresses-protected-n57982079024/

Judgment here: https://curia.europa.eu/jcms/upload/docs/application/pdf/2016-10/cp160112en.pdf

Full text here: http://curia.europa.eu/juris/documents.jsf?num=C-582/14

Final judgment is rather "incomplete" and isn't worldwide. There may be reasons to not want to store full IP's depending on location and possible use.

myReaper commented 7 years ago

@mholt @magikstm I would have provided the same links.

Yes, it is not worldwide but the court ruling is EU wide from what i can tell. We generally have some pretty strong privacy laws here in the EU and Germany.

This ruling also requires (while it not says it clearly) that you mask 2 bytes of the IP address (192.168.xxx.xxx) instead of only 1 byte because it is most likely always possible to identify a user if you just mask 1 byte (192.168.111.xxx).

dmke commented 7 years ago

The current law in Germany (as the GDPR is still half a year waey) 7is the "Bundesdatenschutzgesetz" (BSDG, Federal Data Privacy Law). The exact paragraph is hard to determine, since it runs under the category of a "Verbotsgesetz" (meaning it is forbidden to process personal/identifying data, unless you've got permission from the data owner to do so). The law itself defines exceptions from these limitations, § 1 Zweck und Anwendungsbereich des Gesetzes (purpose and application of the law):

(1) Zweck dieses Gesetzes ist es, den Einzelnen davor zu schützen, dass er durch den Umgang mit seinen personenbezogenen Daten in seinem Persönlichkeitsrecht beeinträchtigt wird. (2) Dieses Gesetz gilt für die Erhebung, Verarbeitung und Nutzung personenbezogener Daten durch

  1. öffentliche Stellen des Bundes,
  2. öffentliche Stellen der Länder, […],
  3. nicht-öffentliche Stellen, soweit sie die Daten unter Einsatz von Datenverarbeitungsanlagen verarbeiten, nutzen oder dafür erheben oder die Daten in oder aus nicht automatisierten Dateien verarbeiten, nutzen oder dafür erheben, es sei denn, die Erhebung, Verarbeitung oder Nutzung der Daten erfolgt ausschließlich für persönliche oder familiäre Tätigkeiten. […]

This roughly translates to (IANAL and not a native english speaker):

(1) Purpose of this law is to protect the individual from impairments of their personal rights when handling individual-related data. (2) This law regulate the inception, processing and use of individual-related data from

  1. public authorities at federal level,
  2. public authorities at non-federal level, […],
  3. non-public bodies, if they process, use or collect data via data processing facilities, or if they use or collect non-automated data, unless the collection, processing or usage is only for personal or familial purposes. […]

(Background: the German Supreme Court has declared data privacy a constitutional right in the 1980s.)

Now, to the fact whether or not an IP address is a person-identifying date: There are AFAIK no court decisions or clear statements in the law. However, the general consensus is, that an IP address on its own does not identify a person (and, by implication, collection and usage is fine), but the moment you are able to reference the address (either by jurisdictional or technical means) with other identifying data, such as name or email address, the IP address becomes an identifying date.

An ISP for example can perform a database lookup to get name, address, ... to an IP address, thus the ISP needs a special agreement with its customers (you'll find such a clause in the contracts).

A personal homepage owner on the other hand might be fine to log all IP addresses (unobfuscated), as far as he/she only serves static pages (i.e. no user login when proxying to an application server, wiki, forum, ...). If the personal homepage contains a contact form, the user must be informed about data processing (including the exact data points processed), even if this is just a simple email forwarder. The moment a user submits the form, the owner can associate the information entered via its timestamp with her/his logs, and the IP address becomes person-identifyable.

mholt commented 7 years ago

Thanks for the info everyone! This is helpful. I'll review when I come back around to this!

tobya commented 7 years ago

I have a working proof of concept for this will try to push tomorrow to see if I'm heading the right direction.

Varbin commented 6 years ago

@tobya Does your patch only work with the access log or will IP-adresses in the error log be masked as well?

tobya commented 6 years ago

@Varbin The ipmask directive will only mask IP addresses output by the {remote} placeholder in the access logs. The error directive doesnt actually do any replacement of anything, it just outputs the error as it comes from your server process. If you wished to mask ip addresses in errors you would need to run some other sort of filter. This could be a directive, but at the moment this does not happen.

Varbin commented 6 years ago

Thank you for your fast answer.

Am 16.02.2018 um 16:23 schrieb Toby Allen:

@Varbin https://github.com/varbin The ipmask directive will only mask IP addresses output by the |{remote}| placeholder in the access logs. The error directive doesnt actually do any replacement of anything, it just outputs the error as it comes from your server process. If you wished to mask ip addresses in errors you would need to run some other sort of filter. This could be a directive, but at the moment this does not happen.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/mholt/caddy/issues/1769#issuecomment-366265039, or mute the thread https://github.com/notifications/unsubscribe-auth/AFtMjG6K6DQ96vOrhoVVTiB5LneyE-98ks5tVZ1bgaJpZM4OZoKw.