[question] scrapping dynamic urls

elixir-crawly / crawly

Crawly, a high-level web crawling & scraping framework for Elixir.

https://hexdocs.pm/crawly

Apache License 2.0

976 stars 115 forks source link

[question] scrapping dynamic urls #186

Closed mario-mazo closed 2 years ago

mario-mazo commented 3 years ago

Hello

Im thinking about using crawly for a project but im not sure whats is the best way to scrap dynamic urls

Like I need to scrap www.something.com/site/AAA all the way to www.something.com/site/ZZZ. The last AAA-ZZZ is a unique identifier

So should I pass the identifier that to start_urls? or should I fetch inside the parse_item

thanks

Ziinc commented 3 years ago

There are two ways:

adding them in start_urls
Incrementally adding them in parse_item

which method depends on how many url permutations you are looking at. if the number is absurdly high (like hundreds of thousands) of urls, then go with method 2.