elixir-crawly / crawly

Crawly, a high-level web crawling & scraping framework for Elixir.
https://hexdocs.pm/crawly
Apache License 2.0
969 stars 114 forks source link

Proxy setup. #65

Closed ghost closed 4 years ago

ghost commented 4 years ago

Hello!

When connecting to a proxy, my IP does not change. I am using ProxyMesh. When I try on my machine via OS Setting, connection by HTTPS are working fine. Does Crawly support HTTPS? Can it be a problem of the issue?

Here my config file:

use Mix.Config
# in config.exs
config :crawly,
  proxy: "us-ca.proxymesh.com:31280",
  closespider_timeout: 10,
  concurrent_requests_per_domain: 7,
  closespider_itemcount: 1000,
  middlewares: [
    Crawly.Middlewares.DomainFilter,
    Crawly.Middlewares.UniqueRequest,
    Crawly.Middlewares.UserAgent
  ],
  pipelines: [
    {Crawly.Pipelines.Validate, fields: [:url]},
    # {Crawly.Pipelines.DuplicatesFilter, item_id: :title},
    Crawly.Pipelines.JSONEncoder,
    {Crawly.Pipelines.WriteToFile, extension: "jl", folder: "/tmp"} # NEW IN 0.7.0
   ],
   port: 4001

To check an IPs of processes, I used this small module:

defmodule Spider.Proxy do
  @behaviour Crawly.Spider

  require Logger

  @impl Crawly.Spider
  def base_url(), do: "https://whatismyipaddress.com/"

  @impl Crawly.Spider
  def init() do
    [
      start_urls: [
        "https://whatismyipaddress.com/"
      ]
    ]
  end

  @impl Crawly.Spider
  def parse_item(response) do
    item = %{
      url:
        Floki.find(
          response.body,
          "div#section_left > div:nth-of-type(2) > div:nth-of-type(1) > a"
        )
        |> Floki.text()
    }

    %Crawly.ParsedItem{:items => [item], :requests => []}
  end
end
Ziinc commented 4 years ago

You will need to add in request options for HTTPoison through the RequestOptions request middleware . The format needs to match the option format in the HTTPoison docs

https://hexdocs.pm/crawly/Crawly.Middlewares.RequestOptions.html

https://hexdocs.pm/httpoison/HTTPoison.Request.html#content

Ziinc commented 4 years ago

@Unumus did you search through the docs before opening this issue? If you did but didnt' find the RequestOptions middleware, we might need to update the docs for it

oltarasenko commented 4 years ago

@Ziinc I was also thinking about making an explicit example somewhere in docs...

ghost commented 4 years ago

@Ziinc, I searched. But not found. I think it will be helpful to mention here: https://hexdocs.pm/crawly/configuration.html#proxy-binary the middleware.requests to complete proxy configuration.

Ziinc commented 4 years ago

@Unumus ah I see, the docs got out of sync, the proxy setting on :config is from an older version that is a breaking change. We'll need to update that.

@oltarasenko I think perhaps a few common use case examples can be shown in the middleware module doc? Besides proxy, what others would be good?

Ziinc commented 4 years ago

@oltarasenko You might need to do a patch release to update hexdocs