elixir-crawly / crawly

Crawly, a high-level web crawling & scraping framework for Elixir.
https://hexdocs.pm/crawly
Apache License 2.0
976 stars 115 forks source link

Adding the original request to the parse_item callback #220

Closed nuno84 closed 6 months ago

nuno84 commented 2 years ago

Based on the discussion I opened: discussion So, I want to crawl, lets say 100 websites and set the item parse elements (floki query elements) set on a webpage for the user to fine-tune. I found this Oleg article usefull: article So my idea is to write a generic HTTP spider and inside it, use that info to parse the data based on the request custom_data. I think this pull is usefull. And I tried to make it backward compatible. So now you can do both:

def parse_item(response), do: ... OR def parse_item(response, request = %Crawly.Request{custom_data: req_data}), do: ...

The apply function is now:

  defp do_parse(nil, spider_name, response, request) do
    if :erlang.function_exported(spider_name, :parse_item, 2) do
      spider_name.parse_item(response, request)
    else
      spider_name.parse_item(response) # This is for backward compatibility
    end
  end

The tests are passing on my computer. Please take a look. I can add some documentation if this idea goes forward. Thank you

starcraft66 commented 1 year ago

Super cool to see this, I was just about to go implement this myself after having a use-case for it (crawling image files for which the metadata is located on the previously-crawled page). I will see how well this works for me and take a look at the failing tests too. Unfortunate that development seems to be stalled on this project.

Ziinc commented 1 year ago

@nuno84 PR is appreciated. The entire Request gets copied to the Response struct, so adding in the second argument to parse_item callback is unnecessary.

I would suggest using the metadata or meta key on the Request struct instead of custom_data, which is more semantically correct.

starcraft66 commented 1 year ago

The entire Request gets copied to the Response struct

@Ziinc The HTTPoison.Response.t() contains the original HTTPoison.Request.t() struct but I don't think it is helpful in the context of the PR because that HTTPoison.Request.t() will not contain any of the metadata stored in the Crawly.Request.t() wrapping the response.