We need a way to persist data computed with out library in some persistent storage, and it should be flexible enough so the user can plug & play his custom persistency methods to store the data as he needs.
Proposed solution
Create a new base class for storage manager objects that can be easily overridden. We propose a class with common operations to perform over the stored data:
get_matching( ScrapedData -> Bool ) -> [ScrapedData] - Given a predicate, retrieve all objects stored matching such predicate
filter_scraped_urls([str]) -> List[str] - Filter out urls already scraped in the given list, leaving only urls that are to be scraped.
was_scraped(str) -> bool - tells if the given url is already scraped (there's no need to scrape it for new data)
save([ScrapedData]) - save provided data instances to disk
delete([str]) - delete data related to provided urls from disk
Also, we provide a json-based implementation for local storage and testing purposes.
Important changed:
add Base Class for some other persistency manager objects
add Json Based implementation
How to test it
import JsonManager class:
from c4v.scraper.persistency_manager.json_storage_manager import JsonManager
c4v/scraper/persistency_manager/base_persistency_manager.py - Base persistency manager class implementation
c4v/scraper/persistency_manager/json_storage_manager.py - Json based implementation
Further work
Write a more efficient implementation of the json manager
create a file manager class, so we can use it instead of manually opening the file. This way we can create fake files to improve testability for the json manager and further new classes that will try to work with files
Some Idea about some other common operations that may be part of a persistency manager?
Include current work in CLI utility to save scraped data into disk
Persistency Scheme
Problem Description
We need a way to persist data computed with out library in some persistent storage, and it should be flexible enough so the user can plug & play his custom persistency methods to store the data as he needs.
Proposed solution
Create a new base class for storage manager objects that can be easily overridden. We propose a class with common operations to perform over the stored data:
get_matching( ScrapedData -> Bool ) -> [ScrapedData]
- Given a predicate, retrieve all objects stored matching such predicatefilter_scraped_urls([str]) -> List[str]
- Filter out urls already scraped in the given list, leaving only urls that are to be scraped.was_scraped(str) -> bool
- tells if the given url is already scraped (there's no need to scrape it for new data)save([ScrapedData])
- save provided data instances to diskdelete([str])
- delete data related to provided urls from diskAlso, we provide a json-based implementation for local storage and testing purposes.
Important changed:
How to test it
import
JsonManager
class:Create an instance of such class
use this object to test its corresponding methods
Relevant files:
c4v/scraper/persistency_manager/base_persistency_manager.py
- Base persistency manager class implementationc4v/scraper/persistency_manager/json_storage_manager.py
- Json based implementationFurther work