marianelamin commented 3 years ago

In order to better define what we can and what we cannot retrieve during scraping, we need to explore with a toy scraper.

Sources potentially needed:

[x] elpitazo.com
[ ] twitter.com
[ ] primicia.com.ve
[ ] efectotocuyo.com
[ ] laprensalara.com.ve
[ ] diariolosandes
[ ] elimpulso.com
[ ] el-carabobeno.com
[ ] cronica.uno
[ ] elnacional.com
[ ] eluniversal.com

Proposed Solution

By Luis,

Scraper Creation

In this guide, we will go through the process of creating a new scraper, which can be summed up in the following steps:

Select an output data format
Implement a BaseScraper subclass
wire the new scraper to installed ones

Selecting an output data format

Every page may have different scrapable information, maybe hashtags in twitter, news section name for some news site. In any case, we don't want to lose such a valuable information. Select one of the available ones if you think it fits your needs.

If you don't see any existing data format in scraper/scraped_data_classes fitting your scrapable data, you can write a new one by creating a file in scraper/scraped_data_classes implementing the base class BaseDataFormat located in scraper/scraped_data_classes/base_scraped_data.py. Such class should implement the to_scraped_data : (self) -> ScrapedData. That method will map from your data format to our currently supported database scheme (represented by the ScrapedData class).

This is needed since scrapers may vary in its needs and scraped data. If, for instance, you require extra clean up logic, you could write it over your custom data format, and test it easier.

Implementing BaseScraper subclass

This step depends on the kind of scraper you want to write. You might want to write a scrapy based scraper. If so, we provide an utility class to make it easie. Otherwise, we also provide a base class whose methods should be implemented to easily add a new scraper.

Scrapy based scrapers:

1) Create a scrapy spider as you would usually do, save it in scraper/spiders. Its parse method should return the data format selected in the previous step. 2) Create a file/module in scraper/scrapers implementing a class inheriting BaseScrapyScraper located in scraper/scrapers/base_scrapy_scraper.py. 3) The only thing that class should add is two class variables:

intended_domain : str = domain intended to by scraped by this scraper
spider : Type[Spider] = spider defined in step 1

From scratch

1) Define a new file/module in scraper/scrapers with a class inheriting and implementing BaseScraper class located in scraper/scrapers/base_scraper.py 2) Such class should implement, at the least, the following methods:

parse(self, responde : Any) -> ScrapedData: function to get data from succesfull response to page
scrape(self, url : Str) -> ScrapedData : Function to get data from response (which may be an arbitrary type depending on implementation details)

Note that every other method is still overridable

Wiring the new scraper

Just go to the scraper/settings.py file, import your new scraper and add it to the list INSTALLED_SCRAPERS

marianelamin commented 3 years ago

This list's prioritization will be pending to conversation with our Vertical and the NGO.

marianelamin commented 3 years ago

@LDiazN Thanks for instructions. Excellent work.

code-for-venezuela / c4v-py

Create toy scraper for main sources #67

Proposed Solution

Scraper Creation

Selecting an output data format

Implementing BaseScraper subclass

Scrapy based scrapers:

From scratch

Wiring the new scraper