elixir-crawly / crawly

Crawly, a high-level web crawling & scraping framework for Elixir.
https://hexdocs.pm/crawly
Apache License 2.0
953 stars 112 forks source link

Take the RobotsTxt User-Agent from the Request #294

Closed adonig closed 3 months ago

adonig commented 3 months ago

This pull request updates the RobotsTxt middleware to dynamically use the User-Agent from each request instead of relying on a hardcoded value. It supersedes an earlier attempt, ensuring that the changes merge cleanly without the previous issues.

adonig commented 3 months ago

Maybe there's a way to squash all those commits into one 😅

adonig commented 3 months ago

Do you believe this test is sufficient?

  test "Respects the User-Agent header when evaluating robots.txt" do
    :meck.expect(Gollum, :crawlable?, fn
      "My Custom Bot", _url -> :crawlable
      _ua, _url -> :uncrawlable
    end)

    middlewares = [
      {Crawly.Middlewares.UserAgent, user_agents: ["My Custom Bot"]},
      Crawly.Middlewares.RobotsTxt
    ]

    req = @valid
    state = %{spider_name: :test_spider, crawl_id: "123"}

    assert {%Crawly.Request{}, _state} =
             Crawly.Utils.pipe(middlewares, req, state)

    middlewares = [Crawly.Middlewares.RobotsTxt]

    assert {false, _state} = Crawly.Utils.pipe(middlewares, req, state)
  end