elixir-crawly / crawly

Crawly, a high-level web crawling & scraping framework for Elixir.
https://hexdocs.pm/crawly
Apache License 2.0
965 stars 113 forks source link

custom parsar callback sample #189

Closed ziyouchutuwenwu closed 1 year ago

ziyouchutuwenwu commented 3 years ago

hi, is there any sample which shows how to use custom parsar callback instead of use default parse_item? i read doc from here, but don't know how to use.

thanks for your help

oltarasenko commented 3 years ago

@Ziinc probably can give more info here.

But could you please describe the use case? Why can't you use parse_item?

ziyouchutuwenwu commented 3 years ago

here is my usage scenario:

for site demo.com, i need to get some info such as title, category for the main page. and get the sub url from some links when i get the sub url, i send requests, then parse data from response, here i need to get some detail info, such as author, price and etc.

the data parsar from sub page should be different from main page, i don't know how to do it through crawly.

great thanks.

ziyouchutuwenwu commented 3 years ago

for python part, my demo code seems like this image

oltarasenko commented 3 years ago

So... Do you have different items on different pages? Or same data just structured differently?

ziyouchutuwenwu commented 3 years ago

yes, basiclly, i have different data structure on different pages, but according to the sample code, i don't know how to write the code. It will be appreciate if there are some examples that can help me.

oltarasenko commented 3 years ago

Sorry I still don't understand if that's one of these two:

  1. Same item which can be extracted with other selectors
  2. Two different items
Ziinc commented 3 years ago

sorry @ziyouchutuwenwu I only just saw this, must have missed the ping.

Parsers are meant for commonly used logic that you want to reuse across spiders. A parser is simply a Pipeline module, with the result of each Parser being passed to the next. The opts 3rd positional arg allows you to provide spider-specific configuration to your parser.

For example, on site 1, you want to extract all links with a h1 tag but filter them out based on some site-specific filter function, and build requests from all extracted links:

# spider 1
parsers: [
  {MyCustomRequestParser, [selector: ".h1", filter: &my_filter_function/1]}
]

Then, in spider 2 that is crawling site 2, we only want h2 tags, but without using any filtering:

# spider 2
parsers: [
  {MyCustomRequestParser, [selector: ".h2"]}
]

Then your MyCustomRequestParser.run/3 contains the logic required to select and build the requests