code-for-venezuela / c4v-py

3 stars 3 forks source link

Create toy scraper for main sources #67

Open marianelamin opened 3 years ago

marianelamin commented 3 years ago

In order to better define what we can and what we cannot retrieve during scraping, we need to explore with a toy scraper.

Sources potentially needed:

image

Proposed Solution

By Luis,

Scraper Creation

In this guide, we will go through the process of creating a new scraper, which can be summed up in the following steps:

  1. Select an output data format
  2. Implement a BaseScraper subclass
  3. wire the new scraper to installed ones

Selecting an output data format


Every page may have different scrapable information, maybe hashtags in twitter, news section name for some news site. In any case, we don't want to lose such a valuable information. Select one of the available ones if you think it fits your needs.

If you don't see any existing data format in scraper/scraped_data_classes fitting your scrapable data, you can write a new one by creating a file in scraper/scraped_data_classes implementing the base class BaseDataFormat located in scraper/scraped_data_classes/base_scraped_data.py. Such class should implement the to_scraped_data : (self) -> ScrapedData. That method will map from your data format to our currently supported database scheme (represented by the ScrapedData class).

This is needed since scrapers may vary in its needs and scraped data. If, for instance, you require extra clean up logic, you could write it over your custom data format, and test it easier.

Implementing BaseScraper subclass


This step depends on the kind of scraper you want to write. You might want to write a scrapy based scraper. If so, we provide an utility class to make it easie. Otherwise, we also provide a base class whose methods should be implemented to easily add a new scraper.

Scrapy based scrapers:

1) Create a scrapy spider as you would usually do, save it in scraper/spiders. Its parse method should return the data format selected in the previous step. 2) Create a file/module in scraper/scrapers implementing a class inheriting BaseScrapyScraper located in scraper/scrapers/base_scrapy_scraper.py. 3) The only thing that class should add is two class variables:

From scratch

1) Define a new file/module in scraper/scrapers with a class inheriting and implementing BaseScraper class located in scraper/scrapers/base_scraper.py 2) Such class should implement, at the least, the following methods:

Note that every other method is still overridable

Wiring the new scraper


Just go to the scraper/settings.py file, import your new scraper and add it to the list INSTALLED_SCRAPERS

marianelamin commented 3 years ago

This list's prioritization will be pending to conversation with our Vertical and the NGO.

marianelamin commented 3 years ago

@LDiazN Thanks for instructions. Excellent work.