We need a specific output data format for every scraper. Since every scraper may have very specific needs of possible scraped data and cleaning logic, we need a way to locate such logic so it can be easily tested and well organized. At the same time, the database scheme may be static and we need a way to map from specific data formats to a general one.
Proposed solution
Creation of 2 Base Classes: ScrapedData, BaseDataFormat
Every scraper should return a BaseDataFormat subclass
Every BaseDataFormat should implement a to_scraped_data(self) -> ScrapedData function to map from such data format to a generic one
All those classes are dataclasses
Relevant files:
scraper/scraped_data_classes/ : folder with output data related classes
scraper/scraped_data_classes/scraped_data.py : file with ScrapedData class, describing final data format
scraper/scraped_data_classes/base_scraped_data.py : file with BaseDataFormat class, describing interface to be implemented by specific data formats
scraper/scrapers/base_scraper.py : Signatures in BaseScraper class changed so they reflect this new behavior
scraper/scraper.py : changed scrape functions to convert output items into ScapedData objects
Additional work:
moved tests to its corresponding folder in tests/scraper
created resource folder to store project-wide resources
Added primicia scraper based in this branch made by @marianelamin
Further work:
Given that there's already a few scrapers that have the same format, we should think about writing generic dataclasses for some common cases, such as news sites
Additional notes:
I apologize in advance for the size of this PR, since the nox issue with my previous PR took longer than expected, a lot of work just stacked up in this branch, I'll try to keep it simpler in the future
Description
We need a specific output data format for every scraper. Since every scraper may have very specific needs of possible scraped data and cleaning logic, we need a way to locate such logic so it can be easily tested and well organized. At the same time, the database scheme may be static and we need a way to map from specific data formats to a general one.
Proposed solution
ScrapedData
,BaseDataFormat
BaseDataFormat
subclassBaseDataFormat
should implement ato_scraped_data(self) -> ScrapedData
function to map from such data format to a generic oneRelevant files:
scraper/scraped_data_classes/
: folder with output data related classesscraper/scraped_data_classes/scraped_data.py
: file withScrapedData
class, describing final data formatscraper/scraped_data_classes/base_scraped_data.py
: file withBaseDataFormat
class, describing interface to be implemented by specific data formatsscraper/scrapers/base_scraper.py
: Signatures in BaseScraper class changed so they reflect this new behaviorscraper/scraper.py
: changed scrape functions to convert output items intoScapedData
objectsAdditional work:
Added primicia scraper based in this branch made by @marianelamin
Further work:
Additional notes:
I apologize in advance for the size of this PR, since the nox issue with my previous PR took longer than expected, a lot of work just stacked up in this branch, I'll try to keep it simpler in the future