In order to better define what we can and what we cannot retrieve during scraping, we need to explore with a toy scraper.
Sources potentially needed:
[x] elpitazo.com
[ ] twitter.com
[ ] primicia.com.ve
[ ] efectotocuyo.com
[ ] laprensalara.com.ve
[ ] diariolosandes
[ ] elimpulso.com
[ ] el-carabobeno.com
[ ] cronica.uno
[ ] elnacional.com
[ ] eluniversal.com
Proposed Solution
By Luis,
Scraper Creation
In this guide, we will go through the process of creating a new scraper, which can be summed up
in the following steps:
Select an output data format
Implement a BaseScraper subclass
wire the new scraper to installed ones
Selecting an output data format
Every page may have different scrapable information, maybe hashtags in
twitter, news section name for some news site. In any case, we don't want to
lose such a valuable information. Select one of the available ones if you think it fits your needs.
If you don't see any existing data format
in scraper/scraped_data_classes fitting your scrapable data, you can
write a new one by creating a file in scraper/scraped_data_classes implementing the base class BaseDataFormat located in scraper/scraped_data_classes/base_scraped_data.py. Such class should implement the
to_scraped_data : (self) -> ScrapedData. That method will map from your data format to our currently supported database scheme (represented by the ScrapedData class).
This is needed since scrapers may vary in its needs and scraped data. If, for instance, you require extra clean up logic, you could write it over your custom data format, and test it easier.
Implementing BaseScraper subclass
This step depends on the kind of scraper you want to write.
You might want to write a scrapy based scraper. If so, we provide an utility class to make it easie. Otherwise, we also provide a base class whose methods should be implemented to easily add a new scraper.
Scrapy based scrapers:
1) Create a scrapy spider as you would usually do, save it in scraper/spiders. Its parse method should return the data format selected in the previous step.
2) Create a file/module in scraper/scrapers implementing a class inheriting BaseScrapyScraper located in scraper/scrapers/base_scrapy_scraper.py.
3) The only thing that class should add is two class variables:
intended_domain : str = domain intended to by scraped by this scraper
spider : Type[Spider] = spider defined in step 1
From scratch
1) Define a new file/module in scraper/scrapers with a class inheriting and implementing BaseScraper class located in scraper/scrapers/base_scraper.py
2) Such class should implement, at the least, the following methods:
parse(self, responde : Any) -> ScrapedData: function to get data from succesfull response to page
scrape(self, url : Str) -> ScrapedData : Function to get data from response (which may be an arbitrary type depending on implementation details)
Note that every other method is still overridable
Wiring the new scraper
Just go to the scraper/settings.py file, import your new scraper and add it to the list INSTALLED_SCRAPERS
In order to better define what we can and what we cannot retrieve during scraping, we need to explore with a toy scraper.
Sources potentially needed:
Proposed Solution
By Luis,
Scraper Creation
In this guide, we will go through the process of creating a new scraper, which can be summed up in the following steps:
Selecting an output data format
Every page may have different scrapable information, maybe hashtags in twitter, news section name for some news site. In any case, we don't want to lose such a valuable information. Select one of the available ones if you think it fits your needs.
If you don't see any existing data format in
scraper/scraped_data_classes
fitting your scrapable data, you can write a new one by creating a file inscraper/scraped_data_classes
implementing the base classBaseDataFormat
located inscraper/scraped_data_classes/base_scraped_data.py
. Such class should implement theto_scraped_data : (self) -> ScrapedData
. That method will map from your data format to our currently supported database scheme (represented by theScrapedData
class).This is needed since scrapers may vary in its needs and scraped data. If, for instance, you require extra clean up logic, you could write it over your custom data format, and test it easier.
Implementing BaseScraper subclass
This step depends on the kind of scraper you want to write. You might want to write a scrapy based scraper. If so, we provide an utility class to make it easie. Otherwise, we also provide a base class whose methods should be implemented to easily add a new scraper.
Scrapy based scrapers:
1) Create a scrapy spider as you would usually do, save it in
scraper/spiders
. Its parse method should return the data format selected in the previous step. 2) Create a file/module inscraper/scrapers
implementing a class inheritingBaseScrapyScraper
located inscraper/scrapers/base_scrapy_scraper.py
. 3) The only thing that class should add is two class variables:intended_domain : str
= domain intended to by scraped by this scraperspider : Type[Spider]
= spider defined in step 1From scratch
1) Define a new file/module in
scraper/scrapers
with a class inheriting and implementingBaseScraper
class located inscraper/scrapers/base_scraper.py
2) Such class should implement, at the least, the following methods:parse(self, responde : Any) -> ScrapedData
: function to get data from succesfull response to pagescrape(self, url : Str) -> ScrapedData
: Function to get data from response (which may be an arbitrary type depending on implementation details)Note that every other method is still overridable
Wiring the new scraper
Just go to the
scraper/settings.py
file, import your new scraper and add it to the listINSTALLED_SCRAPERS