ArchiveTeam / wpull

Wget-compatible web downloader and crawler.
GNU General Public License v3.0
554 stars 77 forks source link

Always send the `Host` header first #468

Open JustAnotherArchivist opened 3 years ago

JustAnotherArchivist commented 3 years ago

Currently, the Host header is always sent last because it is added automatically on wpull.protocol.http.request.Request.prepare_for_send after the other headers were already set. I propose to change this to always send the Host header line first.

Theoretically, this shouldn't matter. The order of header lines is not significant in HTTP. From RFC 7230 section 3.2.2:

The order in which header fields with differing field names are received is not significant. However, it is good practice to send header fields that contain control data first, such as Host on requests and Date on responses, so that implementations can decide when not to handle a message as early as possible.

Unfortunately, it appears that Cloudflare is (since recently?) treating requests where the Host header doesn't come first differently.

Example of different header order producing different results on Cloudflare with curl ``` > curl -A 'Mozilla/5.0 (Windows NT 6.1; rv:60.0) Gecko/20100101 Firefox/60.0' https://bund.lkr.de/ -sv --http1.1 [snip] > GET / HTTP/1.1 > Host: bund.lkr.de > User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:60.0) Gecko/20100101 Firefox/60.0 > Accept: */* > * TLSv1.3 (IN), TLS handshake, Newsession Ticket (4): * TLSv1.3 (IN), TLS handshake, Newsession Ticket (4): * old SSL session ID is stale, removing < HTTP/1.1 307 Temporary Redirect < Date: Mon, 27 Sep 2021 22:35:35 GMT < Content-Type: text/html;charset=UTF-8 < Transfer-Encoding: chunked < Connection: keep-alive < location: /start/ < CF-Cache-Status: DYNAMIC < Expect-CT: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct" < Report-To: [snip] < NEL: {"success_fraction":0,"report_to":"cf-nel","max_age":604800} < Strict-Transport-Security: max-age=0; includeSubDomains; preload < X-Content-Type-Options: nosniff < Server: cloudflare < CF-RAY: [snip] < alt-svc: h3=":443"; ma=86400, h3-29=":443"; ma=86400, h3-28=":443"; ma=86400, h3-27=":443"; ma=86400 < * Connection #0 to host bund.lkr.de left intact > curl -A 'Mozilla/5.0 (Windows NT 6.1; rv:60.0) Gecko/20100101 Firefox/60.0' -H 'Host:' -H 'Host: bund.lkr.de' https://bund.lkr.de/ -sv --http1.1 [snip] > GET / HTTP/1.1 > User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:60.0) Gecko/20100101 Firefox/60.0 > Accept: */* > Host: bund.lkr.de > * TLSv1.3 (IN), TLS handshake, Newsession Ticket (4): * TLSv1.3 (IN), TLS handshake, Newsession Ticket (4): * old SSL session ID is stale, removing < HTTP/1.1 503 Service Temporarily Unavailable < Date: Mon, 27 Sep 2021 22:35:45 GMT < Content-Type: text/html; charset=UTF-8 < Transfer-Encoding: chunked < Connection: close < X-Frame-Options: SAMEORIGIN < Permissions-Policy: accelerometer=(),autoplay=(),camera=(),clipboard-read=(),clipboard-write=(),fullscreen=(),geolocation=(),gyroscope=(),hid=(),interest-cohort=(),magnetometer=(),microphone=(),payment=(),publickey-credentials-get=(),screen-wake-lock=(),serial=(),sync-xhr=(),usb=() < Cache-Control: private, max-age=0, no-store, no-cache, must-revalidate, post-check=0, pre-check=0 < Expires: Thu, 01 Jan 1970 00:00:01 GMT < Expect-CT: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct" < Report-To: [snip] < NEL: {"success_fraction":0,"report_to":"cf-nel","max_age":604800} < Strict-Transport-Security: max-age=0; includeSubDomains; preload < X-Content-Type-Options: nosniff < Server: cloudflare < CF-RAY: [snip] < alt-svc: h3=":443"; ma=86400, h3-29=":443"; ma=86400, h3-28=":443"; ma=86400, h3-27=":443"; ma=86400 < Just a moment... [snip] ``` `-H 'Host:' -H 'Host: bund.lkr.de'` first removes the header and then adds it again, forcing it to be at the end. The 307 is the expected response for this site, the 503 is the Cloudflare JS challenge.
JustAnotherArchivist commented 2 years ago

Another example we stumbled across in #archivebot today. Note that it only happens with HTTP/1.1. Buttflare's HTTP servers are very broken...

Example ``` > curl -sv -A 'Mozilla/5.0 (Windows NT 6.1; rv:60.0) Gecko/20100101 Firefox/60.0' -H 'Host:' -H 'Host: cop.unasiapacific.org' https://cop.unasiapacific.org/feed [snip] > GET /feed HTTP/2 > User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:60.0) Gecko/20100101 Firefox/60.0 > Accept: */* > Host: cop.unasiapacific.org > [snip] < HTTP/2 200 [snip] > curl -sv -A 'Mozilla/5.0 (Windows NT 6.1; rv:60.0) Gecko/20100101 Firefox/60.0' -H 'Host:' -H 'Host: cop.unasiapacific.org' --http1.1 https://cop.unasiapacific.org/feed [snip] > GET /feed HTTP/1.1 > User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:60.0) Gecko/20100101 Firefox/60.0 > Accept: */* > Host: cop.unasiapacific.org > [snip] < HTTP/1.1 403 Forbidden ```