code-for-venezuela / c4v-py

3 stars 3 forks source link

added arguments to stop scraping on demand #88

Closed LDiazN closed 3 years ago

LDiazN commented 3 years ago

This feature was suggested by Alvaro from Angostura

Problem

We need a way to stop the crawler on demand, and a special useful case is when a specific amount of urls were scraped.

Solution

Examples and how to use it

Crawling up to 100 elements:

from c4v.scraper.crawler.crawlers.primicia_crawler import PrimiciaCrawler

crawler = PrimiciaCrawler()

print(l := crawler.crawl_urls(6)) 
print(len(l))

Possible expected output:

['https://primicia.com.ve/sucesos/un-muerto-y-cuatro-heridos-durante-protestas/', 'https://primicia.com.ve/mas/salud/the-lancet-global-health-aumentan-muertes-infantiles-en-venezuela/', 'https://primicia.com.ve/deportes/abre-estadal-en-el-polideportivo%e2%80%88/', 'https://primicia.com.ve/nacion/regresan-79-venezolanos-desde-brasil-con-el-plan-vuelta-a-la-patria/', 'https://primicia.com.ve/deportes/i-valida-de-ciclismo-arranca-el-17-2/', 'https://primicia.com.ve/placeres/critican-a-mimi-lazo-por-chavista/']
6

Crawling as much as you can

from c4v.scraper.crawler.crawlers.primicia_crawler import PrimiciaCrawler

crawler = PrimiciaCrawler()

print(l := crawler.crawl_urls())  # Won't print anything for a while as there's a huge amount of urls
print(len(l))

Crawling for a specific condition

The following code will crawl until we get an even number

from c4v.scraper.crawler.crawlers.primicia_crawler import PrimiciaCrawler
from random import randint

crawler = PrimiciaCrawler()

def should_stop(): return randint(0,9) % 2 == 0

crawler.crawl_and_process_urls(
    post_process_data= print,
    should_stop=should_stop
)

Note that you can capture environment within the scope of the should_stop function to check conditions at any point

Relevant files

dieko95 commented 3 years ago

Approving to unblock you @LDiazN