Open Dziolas opened 9 years ago
@jalavik and @kaplun what do you think?
@Dziolas thanks for this proposal. Indeed to standardize is a great idea. The package could depend on the great flask-registry
functionality, to allow for the auto-discovery of plugins. The only doubt is currently towards also having a OAI-PMH based extension (e.g. to support also Hindawi, and why not, then arXiv). Could harvesting-kit become the entry point for harvesting, with OAI-PMH one of the specific plugins?
Regarding validation, this is offered by the invenio-records
APIs, in the form of validation against JSONSchema. Shall we assume harvesting-kit is actually an Invenio component, thus allowed to exploit Invenio facts?
Generally, the steps will probably look like:
Even though knowledge of Invenio could give us access to a bunch of goodies, I suggest having Harvesting Kit provide useful utilities and common operations as a "leaf package". Then the clients (overlays or custom invenio-modules) decide the flow of things, pass parameters and make use of what Harvesting Kit offers. By using plugins (or contribs) as @Dziolas suggests we could also share the code to harvest and convert a certain feed. For example:
In your instance overlay (this is how you glue harvesting kit and your ingestion workflows):
# overlay/config.py
APS_URL = "http://example.org"
# overlay/tasks: (e.g. scheduled periodically in Celery Beat on the server)
# this function can be generalised (import_string etc. and celery args)
@celery.task
def harvest_aps(workflow, *args, ...):
from invenio_workflows.api import start_delayed
from harvestingkit.contrib.aps import harvest, convert
last_harvested = # some way to retrieve last harvested date
for harvested_record, harvested_files in harvest(
url=cfg.get("APS_URL"),
from=last_harvested_date,
):
clean_record_dict = convert(harvested_record)
payload = {
"files": harvested_files,
"record": clean_record_dict
}
start_delayed(workflow, data=[payload])
# now your very own ingestion workflow takes over (processing)
Then in Harvesting Kit:
# harvestingkit/contrib/aps/__init__.py
from .getter import harvest
from .converter import convert
__all__ = ["harvest", "convert"]
# harvestingkit/contrib/aps/getter.py
def harvest(url, from, until, ...):
# implement core retrieval logic (could be spread over several files)
yield record, files
# harvestingkit/contrib/aps/convert.py
def convert(some_record):
# implement conversion logic (could be using classes in common files or functions etc.)
return cleaned_record
The main body of harvesting kit could then be solely for common utilities:
harvestingkit/ftp.py : contains ftp retrieval functions
harvestingkit/rest.py : contains REST retrieval functions
harvestingkit/extraction.py : contains unzip/untar functions
harvestingkit/text.py : contains text manipulation functions
harvestingkit/xml.py : contains xml manipulation functions
harvestingkit/sanitize.py : contains common sanitization functions
harvestingkit/jats.py : common extraction operations related to JATS formats
etc.
It's a slightly different take on the initial idea, but what do you think?
EDIT: For validation we can pass a jsonschema, or simply do it on the client-side in the celery task or workflow
I like @jalavik idea to keep the harvestingkit separated from Invenio as it is now. I think that flask-registry as it is used in Invenio modules is good too.
I will start implementing things and let you know on the progress.
We all know that harvesting kit is not perfect. Here are some problems that it has:
Solution
Other ideas:
harvestingkit elsevier
but parameters that can be passed to scripts need to be standardized (or something)Some pseudo-code sample:
Let me know what do you think about this idea.