RFC: Harvesting kit v. 2

Dziolas commented 9 years ago

We all know that harvesting kit is not perfect. Here are some problems that it has:

code replication,
week and unstandardised error handling,
different output (record structure) per publisher,
no easy way to add a new publisher/harvester to the kit,
each package is a mix of connection/package retrieval and and data extraction/data manipulation.

Solution

Convert current harvesting packages into plugins which are going to follow interfaces defined by Harvester class,
Introduce one Harvester class that will take care of harvesting and record generation process (Unified Record Creation Process ;p):
- Loads a plugin (elsevier, iop, etc.) which has defined some utility functions that Harvester can use to get metadata from XML file or to correctly connect to the content provider,
- Connection and download of data from content provider/publisher can be defined as workflow,
- Utility functions from plugin will returndata in internal (intermediary) data model defined by Harvester class:
- types,
- sanitization,
- cleaning of metadata,
- all of those can be defined as part of data model or Harvester so it will not need to implemented each time
- Record creation process managed and defined in Harvester - all records will always keep the same pattern of fields and field content regardless harvester (now the same configuration needs to be provided for every package
- Logs and errors will be better structured
- JSON
- unified error handling and logs

Other ideas:

pacakges will become auto-detected plugins as in other modules
CLI will stay like it looks now harvestingkit elsevier but parameters that can be passed to scripts need to be standardized (or something)

Some pseudo-code sample:

har = Harvester("elsevier", "contrast-out")
har.full_harvest()

##########
class Harvester(object):
    def __init__(self, plugin_name, *args):
        self.harvest_plugin = load_harvest_plugin(plugin_name, args)

    def self.source_updated(self):
        return self.harvest_plugin.source_updated()

    def self.full_harvest(self):
        if self.source_updated():
            self.run_download_workflow()
            self.extract_records()
            self.upload()

   def self.extract_records(self):
        rec = create_empty_internal_record()
        rec.doi = self.harvester_plugin.get_doi()
        rec.title = self.harvester_plugin.get_title()
        ....
        self.validate(self.record) # something to make sure that record is complete, it can be done on the fly while creating record.

Let me know what do you think about this idea.

Dziolas commented 9 years ago

@jalavik and @kaplun what do you think?

kaplun commented 9 years ago

@Dziolas thanks for this proposal. Indeed to standardize is a great idea. The package could depend on the great flask-registry functionality, to allow for the auto-discovery of plugins. The only doubt is currently towards also having a OAI-PMH based extension (e.g. to support also Hindawi, and why not, then arXiv). Could harvesting-kit become the entry point for harvesting, with OAI-PMH one of the specific plugins?

Regarding validation, this is offered by the invenio-records APIs, in the form of validation against JSONSchema. Shall we assume harvesting-kit is actually an Invenio component, thus allowed to exploit Invenio facts?

jalavik commented 9 years ago

Generally, the steps will probably look like:

Retrieving the records (OAI-PMH, FTP, FS, REST)
Converting each record from source format to intermediate format (lxml, dojson)
Apply wanted transformations on raw intermediate format to "enriched" intermediate format and return to client.

Even though knowledge of Invenio could give us access to a bunch of goodies, I suggest having Harvesting Kit provide useful utilities and common operations as a "leaf package". Then the clients (overlays or custom invenio-modules) decide the flow of things, pass parameters and make use of what Harvesting Kit offers. By using plugins (or contribs) as @Dziolas suggests we could also share the code to harvest and convert a certain feed. For example:

In your instance overlay (this is how you glue harvesting kit and your ingestion workflows):

# overlay/config.py

APS_URL = "http://example.org"

# overlay/tasks: (e.g. scheduled periodically in Celery Beat on the server)

# this function can be generalised (import_string etc. and celery args)
@celery.task
def harvest_aps(workflow, *args, ...):   
    from invenio_workflows.api import start_delayed
    from harvestingkit.contrib.aps import harvest, convert

    last_harvested = # some way to retrieve last harvested date

    for harvested_record, harvested_files in harvest(
            url=cfg.get("APS_URL"),
            from=last_harvested_date,
        ):
        clean_record_dict = convert(harvested_record)
        payload = {
            "files": harvested_files,
            "record": clean_record_dict
        }
        start_delayed(workflow, data=[payload])
        # now your very own ingestion workflow takes over (processing)

Then in Harvesting Kit:

# harvestingkit/contrib/aps/__init__.py

from .getter import harvest
from .converter import convert

__all__ = ["harvest", "convert"]

# harvestingkit/contrib/aps/getter.py

def harvest(url, from, until, ...):
    # implement core retrieval logic (could be spread over several files)
    yield record, files

# harvestingkit/contrib/aps/convert.py

def convert(some_record):
    # implement conversion logic (could be using classes in common files or functions etc.)
    return cleaned_record

The main body of harvesting kit could then be solely for common utilities:

harvestingkit/ftp.py : contains ftp retrieval functions
harvestingkit/rest.py : contains REST retrieval functions
harvestingkit/extraction.py : contains unzip/untar functions
harvestingkit/text.py : contains text manipulation functions
harvestingkit/xml.py : contains xml manipulation functions
harvestingkit/sanitize.py : contains common sanitization functions
harvestingkit/jats.py : common extraction operations related to JATS formats
etc.

It's a slightly different take on the initial idea, but what do you think?

EDIT: For validation we can pass a jsonschema, or simply do it on the client-side in the celery task or workflow

Dziolas commented 9 years ago

I like @jalavik idea to keep the harvestingkit separated from Invenio as it is now. I think that flask-registry as it is used in Invenio modules is good too.

I will start implementing things and let you know on the progress.

CERNDocumentServer / harvesting-kit

RFC: Harvesting kit v. 2 #140