This feature was suggested by Alvaro from Angostura
Problem
We need a way to stop the crawler on demand, and a special useful case is when a specific amount of urls were scraped.
Solution
Add an argument should_stop : () -> bool to BaseCrawler.crawl_and_process_urls function that, if provided, will be called to check if the crawling process should stop
Add an argument up_to: int to BaseCrawler.crawl_urls function that, if provided, scrapes only a specific amount of urls
from c4v.scraper.crawler.crawlers.primicia_crawler import PrimiciaCrawler
crawler = PrimiciaCrawler()
print(l := crawler.crawl_urls()) # Won't print anything for a while as there's a huge amount of urls
print(len(l))
Crawling for a specific condition
The following code will crawl until we get an even number
from c4v.scraper.crawler.crawlers.primicia_crawler import PrimiciaCrawler
from random import randint
crawler = PrimiciaCrawler()
def should_stop(): return randint(0,9) % 2 == 0
crawler.crawl_and_process_urls(
post_process_data= print,
should_stop=should_stop
)
Note that you can capture environment within the scope of the should_stop function to check conditions at any point
Relevant files
src/c4v/scraper/crawler/crawlers/base_crawler.py = All changes were made to this file
This feature was suggested by Alvaro from Angostura
Problem
We need a way to stop the crawler on demand, and a special useful case is when a specific amount of urls were scraped.
Solution
should_stop : () -> bool
toBaseCrawler.crawl_and_process_urls
function that, if provided, will be called to check if the crawling process should stopup_to: int
toBaseCrawler.crawl_urls
function that, if provided, scrapes only a specific amount of urlsExamples and how to use it
Crawling up to 100 elements:
Possible expected output:
Crawling as much as you can
Crawling for a specific condition
The following code will crawl until we get an even number
Note that you can capture environment within the scope of the
should_stop
function to check conditions at any pointRelevant files
src/c4v/scraper/crawler/crawlers/base_crawler.py
= All changes were made to this file