edicl / hunchentoot

Web server written in Common Lisp
https://edicl.github.io/hunchentoot/
698 stars 123 forks source link

Memory leaks by interning external strings #24

Open hanshuebner opened 12 years ago

hanshuebner commented 12 years ago

Hunchentoot uses CHUNGA's as-keyword method to convert strings to keywords. In several cases, the strings are coming from external sources (e.g. method in incoming requests), and by interning arbitrary strings, memory can be leaking.

sletner commented 12 years ago

I wrote this hack as a work-around. It seems to solve the problem. It should not break most applications, since keywords are usually already interned from the source, but I included an earmuff-override for when it does.

(in-package :chunga)

(defvar *intern-unsafely*)

(defun as-keyword (string &key (destructivep t))
  (or (gethash string +string-to-keyword-hash+)
      (find-symbol (string-upcase string) (find-package "KEYWORD"))
      (if (and (boundp '*intern-unsafely*) *intern-unsafely*)
          (make-keyword string destructivep)
          string)))
hanshuebner commented 12 years ago

I have pondered a similar approach - The problem that I see is that it kind of breaks the representation of headers in Hunchentoot. Unknown headers no longer have a keyword as they key in the headers alist, but a string. While this might not be a big problem, it also does look hackish to me.

I would like to see this problem solved by completely replacing the intern invocations by find-symbol, and change the representation of headers so that they always have a string as their key.

sletner commented 12 years ago

Yeah, changing the headers to always be strings would probably be best, but I imagine that will be rather a bigger project. Plus it will break existing software.

vindarel commented 2 years ago

re headers, I read how @susam configures her Nginx with a limited set of headers:

I have now updated the Nginx configuration to block arbitrary headers coming 
from the remote client and explicitly pass only a limited set of headers to Hunchentoot. 
So now I have something like this in the Nginx reverse proxy configuration:

  proxy_pass_request_headers off;
  proxy_set_header Accept $http_accept;
  proxy_set_header Content-Length $http_content_length;
  proxy_set_header Content-Type $http_content_type;
  proxy_set_header If-Modified-Since $http_if_modified_since;
  proxy_set_header Referer $http_referer;
  proxy_set_header User-Agent $http_user_agent;
  proxy_set_header X-Forwarded-For $remote_addr;

I will push this configuration to the GitHub repository too sometime this weekend. 

repository mentioned: Mathb. edit: https://github.com/susam/mathb/blob/main/etc/nginx/https.mathb.in

daninus14 commented 1 week ago

Yeah, changing the headers to always be strings would probably be best, but I imagine that will be rather a bigger project. Plus it will break existing software.

This could simply be an option in hunchentoot *headers-as-strings* that's set to NIL by default. If set to T then the headers will be represented with strings. That way we avoid breaking any existing applications, and it should be relatively easy to implement.

What do you guys think? Are any of you maintainers? Would a PR for this be accepted?

stassats commented 1 week ago

Maybe we should break things. But only if there's a way to fail at compile-time.

hanshuebner commented 1 week ago

Maybe add a white list of known headers and replace the interning with a lookup? Having access to normalized header names by default is very useful, and using symbols to represent headers is idiomatic.

daninus14 commented 1 week ago

Maybe add a white list of known headers and replace the interning with a lookup? Having access to normalized header names by default is very useful, and using symbols to represent headers is idiomatic.

And then treat non standard headers as strings? I for one use custom headers for things like jwt authentication, captcha tokens, and others...

Here's a list of known headers: https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers

Accept
Accept-CH
Accept-Charset
Accept-Encoding
Accept-Language
Accept-Patch
Accept-Post
Accept-Ranges
Access-Control-Allow-Credentials
Access-Control-Allow-Headers
Access-Control-Allow-Methods
Access-Control-Allow-Origin
Access-Control-Expose-Headers
Access-Control-Max-Age
Access-Control-Request-Headers
Access-Control-Request-Method
Age
Allow
Alt-Svc
Alt-Used
Attribution-Reporting-Eligible
Attribution-Reporting-Register-Source
Attribution-Reporting-Register-Trigger
Authorization
Cache-Control
Clear-Site-Data
Connection
Content-Digest
Content-Disposition
Content-DPR
Content-Encoding
Content-Language
Content-Length
Content-Location
Content-Range
Content-Security-Policy
Content-Security-Policy-Report-Only
Content-Type
Cookie
Critical-CH
Cross-Origin-Embedder-Policy
Cross-Origin-Opener-Policy
Cross-Origin-Resource-Policy
Date
Device-Memory
Digest
DNT
Downlink
DPR
Early-Data
ECT
ETag
Expect
Expect-CT
Expires
Forwarded
From
Host
If-Match
If-Modified-Since
If-None-Match
If-Range
If-Unmodified-Since
Keep-Alive
Last-Modified
Link
Location
Max-Forwards
NEL
No-Vary-Search
Observe-Browsing-Topics
Origin
Origin-Agent-Cluster
Permissions-Policy
Pragma
Priority
Proxy-Authenticate
Proxy-Authorization
Range
Referer
Referrer-Policy
Refresh
Report-To
Reporting-Endpoints
Repr-Digest
Retry-After
RTT
Save-Data
Sec-Browsing-Topics
Sec-CH-Prefers-Color-Scheme
Sec-CH-Prefers-Reduced-Motion
Sec-CH-Prefers-Reduced-Transparency
Sec-CH-UA
Sec-CH-UA-Arch
Sec-CH-UA-Bitness
Sec-CH-UA-Full-Version
Sec-CH-UA-Full-Version-List
Sec-CH-UA-Mobile
Sec-CH-UA-Model
Sec-CH-UA-Platform
Sec-CH-UA-Platform-Version
Sec-Fetch-Dest
Sec-Fetch-Mode
Sec-Fetch-Site
Sec-Fetch-User
Sec-GPC
Sec-Purpose
Sec-WebSocket-Accept
Sec-WebSocket-Extensions
Sec-WebSocket-Key
Sec-WebSocket-Protocol
Sec-WebSocket-Version
Server
Server-Timing
Service-Worker-Navigation-Preload
Set-Cookie
Set-Login
SourceMap
Speculation-Rules
Strict-Transport-Security
Supports-Loading-Mode
TE
Timing-Allow-Origin
Tk
Trailer
Transfer-Encoding
Upgrade
Upgrade-Insecure-Requests
User-Agent
Vary
Via
Viewport-Width
Want-Content-Digest
Want-Digest
Want-Repr-Digest
Warning
Width
WWW-Authenticate
X-Content-Type-Options
X-DNS-Prefetch-Control
X-Forwarded-For
X-Forwarded-Host
X-Forwarded-Proto
X-Frame-Options
X-XSS-Protection
hanshuebner commented 1 week ago

And then treat non standard headers as strings?

Not quite. The documentation suggests that keywords are used for header names when calling HEADER-IN, so the lookup could actually just be FIND-SYMBOL and then find custom headers because they're interned into the keyword package by the source code. Only headers that have no corresponding keyword would end up being treated as strings.

daninus14 commented 1 week ago

I just made this in case it's useful: https://github.com/daninus14/cl-http-headers

hanshuebner commented 1 week ago

I was thinking more in the lines of changing STRING-AS-KEYWORD here:

https://github.com/edicl/hunchentoot/blob/754f7e35c8fd4867118d4fab9c6f175fe4aae2f0/acceptor.lisp#L758

As it stands now, this interns any header name into the keyword package, whereas it should really perform a lookup.

To illustrate, consider this:

* (find-symbol "FOO" :keyword)
NIL
NIL
* :foo
:FOO
* (find-symbol "FOO" :keyword)
:FOO
:EXTERNAL

Once a keyword has been used (i.e. by writing it into the source code of your program), it will be interned into :KEYWORD package and thus be found by FIND-SYMBOL.

One will have to check whether STRING-AS-KEYWORD is used in other places, and what a change in the behavior would mean to those other places. Given that interning strings received from an external source is always unsafe, those call sites should also be changed, if required. I'd also rename the function to LOOKUP-HEADER-SYMBOL. One bit complication is the reader's casing rules which can be influenced by special variables (which is what the "reader defaults" comment refers to). A proper solution will have to adhere to those rules to appease folks who insist on using non-default reader configurations.

daninus14 commented 1 week ago

Ok, I'm not acquainted with the code base to follow, but it seems string-as-keyword is only used once. I couldn't find where the headers are being parsed into being keywords

hanshuebner commented 1 week ago

STRING-AS-KEYWORD is what turns a header into a keyword. It does so by using READ-FROM-STRING instead of INTERN because that way, if a programmer chooses to reconfigure their reader settings in a non-standard way, header case would be interned correctly.

Here's a brief rundown what this is about: By default, most (but not all) lisp implementation convert symbols that are read in the REPL or from source code to upper case. This is why when you type :foo in the REPL, it echoes :FOO. Now, for historic reasons, this behavior is configurable through the Readtable. For some reason, we have at some point in time decided that we want to preserve the case of incoming headers so that if the Readtable is configured to preserve case of symbols, headers would also be interned in mixed case. This is not a good assumption to make anyway, because headers are case insensitive in the http protocol and a server could legitimately send COnTENT-Length: 2345 and still mean the Content-Length header.

For that reason, I think it would be best to replace the implementation of STRING-AS-KEYWORD by a straight call to (find-symbol (string-upcase string) :keyword). The risk of breaking existing code would exist, but I think it would be a slim chance and result in a safer and more predictable behavior.

daninus14 commented 1 week ago

It looks like here is where the headers are parsed: (defun get-request-data (stream) which calls this read-http-headers here https://github.com/edicl/hunchentoot/blob/master/headers.lisp#L285C27-L285C42

That is from chunga, and here in line 182 it calls as-keyword

From here it looks like that's the only place where as-keyword is used: https://github.com/search?q=repo%3Aedicl%2Fchunga%20as-keyword&type=code

So @hanshuebner what you suggested is basically what @sletner had originally done in terms of using keywords if they exist, just without interning them. So the actual solution is similar to what it was?

daninus14 commented 1 week ago

I just realized the code was exactly what @sletner wrote above. I just added it as a PR. The default behavior will not allow interning, but for whoever has an app that will break because of this, it allows it, per @sletner.

It probably makes sense to add this to the docs. Should I add a comment in the docs if you'll approve the PR?

daninus14 commented 1 week ago

This issue can now be closed since it was solved by https://github.com/edicl/chunga/pull/11