CERNDocumentServer / harvesting-kit

A kit containing various utilities and scripts related to content harvesting used in Invenio Software (http://invenio-software.org) instances such as INSPIRE (http://inspirehep.net) and SCOAP3 (http://scoap3.org)
GNU General Public License v2.0
7 stars 18 forks source link

RFC: Harvesting kit v. 2 #140

Open Dziolas opened 9 years ago

Dziolas commented 9 years ago

We all know that harvesting kit is not perfect. Here are some problems that it has:

Solution

Other ideas:

Some pseudo-code sample:

har = Harvester("elsevier", "contrast-out")
har.full_harvest()

##########
class Harvester(object):
    def __init__(self, plugin_name, *args):
        self.harvest_plugin = load_harvest_plugin(plugin_name, args)

    def self.source_updated(self):
        return self.harvest_plugin.source_updated()

    def self.full_harvest(self):
        if self.source_updated():
            self.run_download_workflow()
            self.extract_records()
            self.upload()

   def self.extract_records(self):
        rec = create_empty_internal_record()
        rec.doi = self.harvester_plugin.get_doi()
        rec.title = self.harvester_plugin.get_title()
        ....
        self.validate(self.record) # something to make sure that record is complete, it can be done on the fly while creating record.

Let me know what do you think about this idea.

Dziolas commented 9 years ago

@jalavik and @kaplun what do you think?

kaplun commented 9 years ago

@Dziolas thanks for this proposal. Indeed to standardize is a great idea. The package could depend on the great flask-registry functionality, to allow for the auto-discovery of plugins. The only doubt is currently towards also having a OAI-PMH based extension (e.g. to support also Hindawi, and why not, then arXiv). Could harvesting-kit become the entry point for harvesting, with OAI-PMH one of the specific plugins?

Regarding validation, this is offered by the invenio-records APIs, in the form of validation against JSONSchema. Shall we assume harvesting-kit is actually an Invenio component, thus allowed to exploit Invenio facts?

jalavik commented 9 years ago

Generally, the steps will probably look like:

  1. Retrieving the records (OAI-PMH, FTP, FS, REST)
  2. Converting each record from source format to intermediate format (lxml, dojson)
  3. Apply wanted transformations on raw intermediate format to "enriched" intermediate format and return to client.

Even though knowledge of Invenio could give us access to a bunch of goodies, I suggest having Harvesting Kit provide useful utilities and common operations as a "leaf package". Then the clients (overlays or custom invenio-modules) decide the flow of things, pass parameters and make use of what Harvesting Kit offers. By using plugins (or contribs) as @Dziolas suggests we could also share the code to harvest and convert a certain feed. For example:

In your instance overlay (this is how you glue harvesting kit and your ingestion workflows):

# overlay/config.py

APS_URL = "http://example.org"

# overlay/tasks: (e.g. scheduled periodically in Celery Beat on the server)

# this function can be generalised (import_string etc. and celery args)
@celery.task
def harvest_aps(workflow, *args, ...):   
    from invenio_workflows.api import start_delayed
    from harvestingkit.contrib.aps import harvest, convert

    last_harvested = # some way to retrieve last harvested date

    for harvested_record, harvested_files in harvest(
            url=cfg.get("APS_URL"),
            from=last_harvested_date,
        ):
        clean_record_dict = convert(harvested_record)
        payload = {
            "files": harvested_files,
            "record": clean_record_dict
        }
        start_delayed(workflow, data=[payload])
        # now your very own ingestion workflow takes over (processing)

Then in Harvesting Kit:

# harvestingkit/contrib/aps/__init__.py

from .getter import harvest
from .converter import convert

__all__ = ["harvest", "convert"]

# harvestingkit/contrib/aps/getter.py

def harvest(url, from, until, ...):
    # implement core retrieval logic (could be spread over several files)
    yield record, files

# harvestingkit/contrib/aps/convert.py

def convert(some_record):
    # implement conversion logic (could be using classes in common files or functions etc.)
    return cleaned_record

The main body of harvesting kit could then be solely for common utilities:

harvestingkit/ftp.py : contains ftp retrieval functions
harvestingkit/rest.py : contains REST retrieval functions
harvestingkit/extraction.py : contains unzip/untar functions
harvestingkit/text.py : contains text manipulation functions
harvestingkit/xml.py : contains xml manipulation functions
harvestingkit/sanitize.py : contains common sanitization functions
harvestingkit/jats.py : common extraction operations related to JATS formats
etc.

It's a slightly different take on the initial idea, but what do you think?

EDIT: For validation we can pass a jsonschema, or simply do it on the client-side in the celery task or workflow

Dziolas commented 9 years ago

I like @jalavik idea to keep the harvestingkit separated from Invenio as it is now. I think that flask-registry as it is used in Invenio modules is good too.

I will start implementing things and let you know on the progress.