Take the RobotsTxt User-Agent from the Request

adonig commented 3 months ago

This pull request updates the RobotsTxt middleware to dynamically use the User-Agent from each request instead of relying on a hardcoded value. It supersedes an earlier attempt, ensuring that the changes merge cleanly without the previous issues.

adonig commented 3 months ago

Maybe there's a way to squash all those commits into one 😅

adonig commented 3 months ago

Do you believe this test is sufficient?

  test "Respects the User-Agent header when evaluating robots.txt" do
    :meck.expect(Gollum, :crawlable?, fn
      "My Custom Bot", _url -> :crawlable
      _ua, _url -> :uncrawlable
    end)

    middlewares = [
      {Crawly.Middlewares.UserAgent, user_agents: ["My Custom Bot"]},
      Crawly.Middlewares.RobotsTxt
    ]

    req = @valid
    state = %{spider_name: :test_spider, crawl_id: "123"}

    assert {%Crawly.Request{}, _state} =
             Crawly.Utils.pipe(middlewares, req, state)

    middlewares = [Crawly.Middlewares.RobotsTxt]

    assert {false, _state} = Crawly.Utils.pipe(middlewares, req, state)
  end

elixir-crawly / crawly

Take the RobotsTxt User-Agent from the Request #294