Persistency Scheme

Problem Description

We need a way to persist data computed with out library in some persistent storage, and it should be flexible enough so the user can plug & play his custom persistency methods to store the data as he needs.

Proposed solution

Create a new base class for storage manager objects that can be easily overridden. We propose a class with common operations to perform over the stored data:

get_matching( ScrapedData -> Bool ) -> [ScrapedData] - Given a predicate, retrieve all objects stored matching such predicate
filter_scraped_urls([str]) -> List[str] - Filter out urls already scraped in the given list, leaving only urls that are to be scraped.
was_scraped(str) -> bool - tells if the given url is already scraped (there's no need to scrape it for new data)
save([ScrapedData]) - save provided data instances to disk
delete([str]) - delete data related to provided urls from disk

Also, we provide a json-based implementation for local storage and testing purposes.

Important changed:

add Base Class for some other persistency manager objects
add Json Based implementation

How to test it

import JsonManager class:

from c4v.scraper.persistency_manager.json_storage_manager import JsonManager

Create an instance of such class

jman = JsonManager("path/to/json_file/in/your/computer.json")

use this object to test its corresponding methods

Relevant files:

c4v/scraper/persistency_manager/base_persistency_manager.py - Base persistency manager class implementation
c4v/scraper/persistency_manager/json_storage_manager.py - Json based implementation

Further work

Write a more efficient implementation of the json manager
create a file manager class, so we can use it instead of manually opening the file. This way we can create fake files to improve testability for the json manager and further new classes that will try to work with files
Some Idea about some other common operations that may be part of a persistency manager?
Include current work in CLI utility to save scraped data into disk

code-for-venezuela / c4v-py

Luis/persistency scheme #81