elixir-crawly / crawly

Crawly, a high-level web crawling & scraping framework for Elixir.
https://hexdocs.pm/crawly
Apache License 2.0
965 stars 113 forks source link

Adding the original request to the parse_item callback #219

Closed nuno84 closed 1 year ago

nuno84 commented 1 year ago

Based on the discussion I opened: discussion So, I want to crawl, lets say 100 websites and set the item parse elements (floki query elements) set on a webpage for the user to fine-tune. I found this Oleg article usefull: article So my idea is to write a generic HTTP spider and inside it, use that info to parse the data based on the request custom_data. So the parse rules for website a.com is different than for b.com. And I dont want to query a database for every item. So I need to pass through that info to the parse_item (and subsequent requests).

I am not sure if there is another way this can be done as it is now, so I think this pull is usefull. And I tried to make it backward compatible. So now you can do both:

def parse_item(response), do: ... OR def parse_item(response, request = %Crawly.Request{custom_data: req_data}), do: ...

The apply function is now:

defp do_parse(nil, spider_name, response, request) do if :erlang.function_exported(spider_name, :parse_item, 2) do spider_name.parse_item(response, request) else spider_name.parse_item(response) # This is for backward compatibility end end

The tests are passing on my computer. I am still learning Elixir, so please take a good look. Any feedback is appreciated. I can add some documentation if this idea goes forward. Thank you