Closed nuno84 closed 6 months ago
Super cool to see this, I was just about to go implement this myself after having a use-case for it (crawling image files for which the metadata is located on the previously-crawled page). I will see how well this works for me and take a look at the failing tests too. Unfortunate that development seems to be stalled on this project.
@nuno84 PR is appreciated. The entire Request
gets copied to the Response
struct, so adding in the second argument to parse_item
callback is unnecessary.
I would suggest using the metadata
or meta
key on the Request
struct instead of custom_data
, which is more semantically correct.
The entire Request gets copied to the Response struct
@Ziinc The HTTPoison.Response.t()
contains the original HTTPoison.Request.t()
struct but I don't think it is helpful in the context of the PR because that HTTPoison.Request.t()
will not contain any of the metadata stored in the Crawly.Request.t()
wrapping the response.
Based on the discussion I opened: discussion So, I want to crawl, lets say 100 websites and set the item parse elements (floki query elements) set on a webpage for the user to fine-tune. I found this Oleg article usefull: article So my idea is to write a generic HTTP spider and inside it, use that info to parse the data based on the request custom_data. I think this pull is usefull. And I tried to make it backward compatible. So now you can do both:
def parse_item(response), do: ...
ORdef parse_item(response, request = %Crawly.Request{custom_data: req_data}), do: ...
The apply function is now:
The tests are passing on my computer. Please take a look. I can add some documentation if this idea goes forward. Thank you