elixir-crawly / crawly

Crawly, a high-level web crawling & scraping framework for Elixir.
https://hexdocs.pm/crawly
Apache License 2.0
965 stars 113 forks source link

Protocol error #238

Closed assertnotnull closed 5 months ago

assertnotnull commented 1 year ago

I have followed the docs and wrote a simple spider but running it gives me a protocol error. Elixir 1.14 Erlang 24 Crawly 0.14

defmodule BasicSpider do
  use Crawly.Spider

  @impl Crawly.Spider
  def base_url do
    "https://www.metal-archives.com"
  end

  @impl Crawly.Spider
  def init() do
    [
      start_urls: [
        "https://www.metal-archives.com/bands/Judas_Priest/97"
      ]
    ]
  end

  @impl Crawly.Spider
  def parse_item(response) do
    {:ok, document} = Floki.parse_document(response.body)
    IO.inspect(document)

    items =
      document
      |> Floki.find("#band_content")
      |> Enum.map(fn x ->
        %{
          name: Floki.find(x, ".band_name") |> Floki.text()
        }
      end)

    IO.inspect(items)

    %Crawly.ParsedItem{items: items, requests: []}
  end
end

Error:

** (Protocol.UndefinedError) protocol String.Chars not implemented for %Crawly.Request{url: "https://www.metal-archives.com/bands/Judas_Priest/97", headers: [{"User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36"}], prev_response: nil, options: [], middlewares: [{Crawly.Middlewares.UserAgent, [user_agents: ["Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36"]]}, {Crawly.Pipelines.WriteToFile, [folder: "./tmp", extension: "jl"]}], retries: 0} of type Crawly.Request (a struct). This protocol is implemented for the following type(s): Atom, BitString, Date, DateTime, Decimal, Float, Floki.Selector, Floki.Selector.AttributeSelector, Floki.Selector.Combinator, Floki.Selector.Functional, Floki.Selector.PseudoClass, Hex.Solver.Assignment, Hex.Solver.Constraints.Empty, Hex.Solver.Constraints.Range, Hex.Solver.Constraints.Union, Hex.Solver.Incompatibility, Hex.Solver.PackageRange, Hex.Solver.Term, Integer, List, NaiveDateTime, Phoenix.LiveComponent.CID, Postgrex.Copy, Postgrex.Query, Time, URI, Version, Version.Requirement
oltarasenko commented 1 year ago

Hey @assertnotnull the code above seem to work fine for me:

iex(2)>
10:45:07.962 [warning] Description: 'Server authenticity is not verified since certificate path validation is not enabled'
     Reason: 'The option {verify, verify_peer} and one of the options \'cacertfile\' or \'cacerts\' are required to enable this.'

[
  %{
    name: "Judas Priest",
    url: "https://www.metal-archives.com/bands/Judas_Priest/97"
  }
]

10:45:09.800 [debug] Stored item: %{name: "Judas Priest", url: "https://www.metal-archives.com/bands/Judas_Priest/97"}

Could you provide a bit more info about the case? To me it looks like you have a problem with one of inspects in the code above.

oltarasenko commented 1 year ago

Hey @assertnotnull the code above seem to work fine for me:

iex(2)>
10:45:07.962 [warning] Description: 'Server authenticity is not verified since certificate path validation is not enabled'
     Reason: 'The option {verify, verify_peer} and one of the options \'cacertfile\' or \'cacerts\' are required to enable this.'

[
  %{
    name: "Judas Priest",
    url: "https://www.metal-archives.com/bands/Judas_Priest/97"
  }
]

10:45:09.800 [debug] Stored item: %{name: "Judas Priest", url: "https://www.metal-archives.com/bands/Judas_Priest/97"}

Could you provide a bit more info about the case? To me it looks like you have a problem with one of inspects in the code above.

zongwu233 commented 1 year ago

I have the same problem.

Elixir 1.14.3 (compiled with Erlang/OTP 25) Crawly 0.14

zongwu233 commented 1 year ago

I guess I found the reason. I checked the code of version 0.14 that the local project depends on, and It is strange that there is no below implementation code in the Request :

defimpl String.Chars, for: Crawly.Request do
  def to_string(s) do
    inspect(s)
  end
end

But the master branch have it. Is there a problem with the release of 0.14 code?

zongwu233 commented 1 year ago

Yeah, I directly dependet on the master branch in mix.exs, and the error disappears.

oltarasenko commented 1 year ago

Strange. @zongwu233, as I see that code was added 6 months ago. In any case, I am preparing the 0.15.0, so hopefully it will disappear soon.

oltarasenko commented 1 year ago

Strange. @zongwu233, as I see that code was added 6 months ago. In any case, I am preparing the 0.15.0, so hopefully it will disappear soon.