elixir-crawly / crawly

Crawly, a high-level web crawling & scraping framework for Elixir.
https://hexdocs.pm/crawly
Apache License 2.0
953 stars 112 forks source link

Add normalise_url parameter to Crawly.Middlewares.UniqueRequest #295

Open adonig opened 3 months ago

adonig commented 3 months ago

This parameter allows to customize the normalization behavior using a unary normalization function.

oltarasenko commented 3 months ago

Sorry, I don't understand the idea behind this change. Could you please explain?

adonig commented 3 months ago

It allows you to use a different normalization function without having to replace the whole UniqueRequest middleware. For instance, while some search engines consider URLs with and without a trailing slash as identical, others do not. Some go even further and treat /index.html the same way. Some transform the host part of the URL to lowercase and some even sort the query string parameters alphanumerically or resolve relative paths. For example I use this normalization function:

  def normalize_url(url) do
    parsed = URI.parse(url)

    if parsed.scheme in ["http", "https"] and parsed.host do
      %URI{
        scheme: parsed.scheme,
        host: String.downcase(parsed.host),
        path: parsed.path,
        query: parsed.query
      }
      |> URI.to_string()
    else
      nil
    end
  end
adonig commented 3 months ago

Section 6 of RFC 3986 goes a bit deeper into the topic of URL normalization.

adonig commented 2 months ago

I found out that Erlang comes with a RFC 3986-compliant URL normalization function: :uri_string.normalize/1

I believe it's still a good idea to allow people to provide their own implementation, because some might want to extend the behavior of the RFC, like for example Cloudflare or Kaspersky do.