elixir-crawly / crawly

Crawly, a high-level web crawling & scraping framework for Elixir.
https://hexdocs.pm/crawly
Apache License 2.0
976 stars 115 forks source link

[question] scrapping dynamic urls #186

Closed mario-mazo closed 2 years ago

mario-mazo commented 3 years ago

Hello

Im thinking about using crawly for a project but im not sure whats is the best way to scrap dynamic urls

Like I need to scrap www.something.com/site/AAA all the way to www.something.com/site/ZZZ. The last AAA-ZZZ is a unique identifier

So should I pass the identifier that to start_urls? or should I fetch inside the parse_item

thanks

Ziinc commented 3 years ago

There are two ways:

  1. adding them in start_urls
  2. Incrementally adding them in parse_item

which method depends on how many url permutations you are looking at. if the number is absurdly high (like hundreds of thousands) of urls, then go with method 2.