Allow not validating target

PragTob commented 2 weeks ago

:wave:

Hi there everyone and thank you so much for mint - been a joy to use via req and helps us make a significant amount of HTTP requests! :green_heart:

Problem

In our context we're frequently conftonted with query strings that aren't strictly standard conform. These contain "macros". Those look something like this: referrer={encSite} and can occur multiple times per URL.

There are many different patterns of these that I know of, usually occurring in the value part of the query params, some more patterns I know of:

key=[%macro%]
key=${macro}
key=%%macro%%
"key=%ebuy!"

This blows up in Mint.HTTP1.Request.validate_target!/1 ( and /2) https://github.com/elixir-mint/mint/blob/50b11d668b6a240b0d9b20c67fbb41a10a7410b1/lib/mint/http1/request.ex#L49-L65

While I understand enforcing standards, sometimes the real (old) applications out there makes that hard.

Small example

``` iex(1)> HttpService.get!("https://some.url.com/path?rb=51439&ca=20863020&ra=1730132562831&_o=51439&_t=20863020") # in this example, a redirect happens ** (Req.HTTPError) invalid request target: "/path?c=bd8618c307ae9885a12561b7191e2cea&cid=5134455426882927793&referrer={encSite}&some=more" (req 0.5.6) lib/req.ex:1092: Req.request!/2 iex:1: (file) ```

Solution

I'd be really happy about being able to provide an option where validating the target is skipped or made more lenient. Or some way in which I could easily accomplish that. I'm not sure of the consequences of this but yeah :)

(and then get to it through Req and Finch)

I'm happy to implement code to do this, once aligned on a solution.

I also have a bunch of affected URLs in tests (running against bypass) or as examples lying around. So, I can definitely test solutions.

Workaround

The current workaround we use is that we implemented code to sanitize the query string (working similar to validate_target!/1, it's fun as the URLs we get are partially encoded and partially not so running encode on all of it is not what we want).

The topic gets a bit harder though, as sometimes redirects will also include these broken URLs. Which is also what you see in the example I posted above. As a workaround for this we're implementing our own redirect following now (so that we can sanitize the query strings on every hop). This is the part where I decided to open this issue, as I'd really like to get rid of this at least :sweat_smile: (but at best all the custom code I wrote for this).

Behavior of other HTTP clients

For funzies I tried the URL in multiple HTTP clients, and it worked in all of them: Firefox, Chrome, curl, wget

So while I don't think it's standard conform, it seems to be common practice among some of the most popular/widely used "HTTP" clients. So, I think at least having the option to process it would be nice.

Again, thanks a lot for all your work in providing mint and friends. It's much appreciated! :green_heart:

wojtekmach commented 2 weeks ago

Here is another example of this https://github.com/wojtekmach/req/issues/270#issue-1993428506 (cc @thbar):

iex> {:ok, conn} = Mint.HTTP.connect(:https, "api.oisemob.cityway.fr", 443)
iex> url = "/dataflow/offre-tc/download?provider=COROLIS_URB|COROLIS_INT&dataFormat=NETEX&dataProfil=OPENDATA"
iex> {:error, conn, e} = Mint.HTTP.request(conn, "GET", url, [], "")
iex> e
%Mint.HTTPError{
  reason: {:invalid_request_target,
   "/dataflow/offre-tc/download?provider=COROLIS_URB|COROLIS_INT&dataFormat=NETEX&dataProfil=OPENDATA"},
  module: Mint.HTTP1
}

@PragTob if you URI-encode offending characters in your requests, would they still work? It works in the example above:

iex> {:ok, conn} = Mint.HTTP.connect(:https, "api.oisemob.cityway.fr", 443)
iex> url = "/dataflow/offre-tc/download?provider=COROLIS_URB|COROLIS_INT&dataFormat=NETEX&dataProfil=OPENDATA"
iex> url = String.replace(url, "|", URI.encode("|"))
iex> {:ok, conn, ref} = Mint.HTTP.request(conn, "GET", url, [], "")

Instead of making Mint more relaxed, while I'm not super looking forward to doing that, I think I can automatically URI-encode a given URL in Req. (I already do when using params: ....)

WDYT?

PragTob commented 2 weeks ago

@wojtekmach thanks for the comment :pray:

The issue I had with URL encoding the query string was that I think it'll usually end up encoding things twice.

I.e. if I exchange my sanitize query code with just URI.encode I get test failures like this:

  1) test sanitize/1 if it's already encoded we don't re-encode it (SanitizeQueryStringTest)
     test/sanitize_query_string_test.exs:41
     Assertion with == failed
     code:  assert "key=val%3Dval" == sanitize("key=val%3Dval")
     left:  "key=val%3Dval"
     right: "key=val%253Dval"
     stacktrace:
       test/sanitize_query_string_test.exs:42: (test)

And yeah I don't want to double encode values I think. Or maybe I'm not getting your approach :thinking:

A solution in Req would definitely work for me, I just wanted to start at mint as that's where the thrown error originates :)

To illustrate, I think it may be helpful to share the code & tests we're currently using for the escaping:

SanitizeQueryStringTest

```elixir defmodule SanitizeQueryStringTest do use ExUnit.Case, async: true import SanitizeQueryString describe "sanitize/1" do test "an normal query string survives easily" do query_string = "a=123" assert query_string == sanitize(query_string) end test "an empty query string stays an empty query string" do assert "" == sanitize("") end test "nil returns an empty query string" do assert "" == sanitize(nil) end test "macros are indeed removed from the query string" do assert "a=b&d=%5B%25woof%25%5D&c=%24%7Bmagic%7D&b=%25%25mega%25%25" == sanitize("a=b&d=[%woof%]&c=${magic}&b=%%mega%%") end test "empty values are also handled fine" do assert "a=" == sanitize("a=") assert "a=1&b=" == sanitize("a=1&b=") end test "just keys we also leave alone" do assert "key" == sanitize("key") assert "oof=first&ha" == sanitize("oof=first&ha") end test "if somehow the query has more `=`s than it should have" do assert "key=val%3Dval" == sanitize("key=val=val") assert "key=val&key2=val%3Dvaleria" == sanitize("key=val&key2=val=valeria") end test "if it's already encoded we don't re-encode it" do assert "key=val%3Dval" == sanitize("key=val%3Dval") end test "`%ebuy!` while technically valid isn't something we want as ! is reserved" do # also bypass exploded assert "key=%25ebuy%21" == sanitize("key=%ebuy!") end test "leaves normal values alone even with lots of % encodings" do query_string = URI.encode_query(%{"a" => "1234", "magic" => "like-to-talk", "text" => "!!!!11232?%$"}) assert query_string == sanitize(query_string) end test "handles [%...%] macros" do assert "key=%5B%25macro%25%5D" == sanitize("key=[%macro%]") end test "handles ${...} macros" do assert "key=%24%7Bmacro%7D" == sanitize("key=${macro}") end test "handles %%...%% macros" do assert "key=%25%25macro%25%25" == sanitize("key=%%macro%%") end # if you did a full string regex approach the eagerness may easily lead to this (see U modifier) test "does not overeagerly snatch up values in-between macros" do query_string = "a=%%magic%%&key=value&b=%%macro%%" assert "a=%25%25magic%25%25&key=value&b=%25%25macro%25%25" == sanitize(query_string) end @leave_alone_values [ "${miss-closing", "%%abds%", "%huhu%%", "abc${dfg}", "abcd%%huha%%", "so[%this-is-not-start-of-the-value%]", "[%this-is-not-end-of-the-value%]so-much-more", "[%am-partial]", "[other-partial%]", "{missing-dollars}" ] for value <- @leave_alone_values do @value value test "#{value} is encoded but decoded it's the same again" do query_string = "some-key=#{@value}" sanitized = sanitize(query_string) decoded = URI.decode(sanitized) assert query_string == decoded end end test "and how about someone puts an entire URL in there" do assert "url=https%3A%2F%2Fbpi.rtactivate.com%2Ftag%2Fid%2FD1107" == sanitize("url=https://bpi.rtactivate.com/tag/id/D1107") end test "and how about someone puts an entire URL in there, but escaped" do already_escaped = "url=https%3A%2F%2Fbpi.rtactivate.com%2Ftag%2Fid%2FD1107" assert already_escaped == sanitize(already_escaped) end end end ```

SanitizeQueryString

```elixir defmodule SanitizeQueryString do # _should_ be safe to do, see: # * https://url.spec.whatwg.org/#urlencoded-parsing # * https://www.rfc-editor.org/rfc/rfc3986#section-2.2 def sanitize(query_string) when is_binary(query_string) and query_string != "" do query_string |> String.split("&") |> Enum.map_join("&", &sanitize_values/1) end def sanitize(_), do: "" defp sanitize_values(sequence) do sequence # We expect only one `=` - otherwise encode it |> String.split("=", parts: 2) |> escape_value() end defp escape_value([key, original_value]) do sanitized_value = if needs_sanitization?(original_value) do URI.encode_www_form(original_value) else original_value end "#{key}=#{sanitized_value}" end defp escape_value([key]), do: key defp needs_sanitization?(string) do # Went through many iterations before we got here. # Went this way because: # 1. wanted to stop playing whack-a-mole with ever new macro patterns and other weird chars # 2. it's not as easy as checking for unescaped characters and escaping, as "%%" is common, but not a valid escape !valid_percent_encoding?(string) end # https://url.spec.whatwg.org/#percent-decode defp valid_percent_encoding?(string) @valid_hex_characters ~c"0123456789abcdefABCDEF" # percent encoding is % followed by 2 hex characters defp valid_percent_encoding?(<>) when hex1 in @valid_hex_characters and hex2 in @valid_hex_characters do valid_percent_encoding?(rest) end defp valid_percent_encoding?(<>) do # Technically this _could_ be `URI.char_unescaped?/1` but bypass exploded for # something like `%ebuy!` which isn't a great sign for its general applicably if URI.char_unreserved?(char) do valid_percent_encoding?(rest) else false end end # everything was valid, so guess it's valid! defp valid_percent_encoding?(<<>>), do: true end ```

wojtekmach commented 2 weeks ago

Fair enough about not wanting to double-encode. Are you able to use Req params feature by any chance?

iex> r = Req.new(url: "/", params: [key: "val=val"]) ; Req.Request.prepare(r).url |> to_string()
"/?key=val%3Dval"

PragTob commented 2 weeks ago

@wojtekmach I don't think we are. Our use cases/issue are two:

pass on the query params send to us in a request to another URL which we have saved (so we retrieve the URL, attach the sanitized incoming query string and send it off to Req - we're a fancy proxy)
the case when the "macro" param shows up in a redirect down the line, and then req just gets the URL from location not passing through params

One of our first attempts to fix it was to have Plug parse the params (we don't usually) but that also blew up and so I removed it, attempting to work on the String base to get the bad values out as early as possible :sweat_smile: - Thinking about it, this may be worth an issue in Plug as well, but working around that was "easi-er" for us :)

wojtekmach commented 2 weeks ago

What would be the issue in Plug, that it shouldn't blow up i.e. have a relaxed parsing?

OK, I could see an argument to skip validation, similar to recent :case_sensitive_headers option. Garbage in garbage out huh? But yeah, I'm curious if there are other solutions. Perhaps smarter sanitation in your code (or Req) that does not double-encode? Though that seems pretty fragile.

PragTob commented 2 weeks ago

I forget what the exception was but yeah parsing failed. I'd need to mull it over.

I sadly agree with the "garbage in, garbage out" - exactly the same as :case_sensitive_headers - I've run into that particular issue more than once :cry: (not with Req/Mint). Like, I like to be standard conformant and compliant so ideologically I support it - but the world out there isn't standard conformant/compliant and that's the world in which our applications have to live. The clinching point for me was Chrome, curl and wget all supporting these URLs without complaint. At that point it feels like a "lived" standard to me.

As for sanitation, the code I posted above does exactly that and I'm relatively confident in the tests. I'm not 100% sure I used the correct reserved character function above (but I think Mint kept blowing up when I used the other one). So, I think it's possible (before that we were playing whack-a-mole with hard-coded patterns).

That said, that might work but I think I'd prefer for that to be a "non-issue" given an option in Mint. Unless there is some other problem with that that I don't see/understand. As long as it's opt-in in like "be careful, you may shoot yourself in the foot with this" I think that's perfectly reasonable (but I also don't have the context of a maintainer her).

whatyouhide commented 2 weeks ago

Even though the lack of bunny pictures is disappointing @PragTob, yeah I think this can make sense. The option can be on by default, but allowed to be false to skip validation. Any chance you would want to work on this? 🙃

PragTob commented 2 weeks ago

@whatyouhide 😂😂😂😂

I was contemplating it but thought maybe people get annoyed when I waste their premium screen space with bunny pictures all the time. I shall make up for my past transgressions!

IMG_20171227_094933_Bokeh