CenterForOpenScience / scrapi

A data processing pipeline that schedules and runs content harvesters, normalizes their data, and outputs that normalized data to a variety of output streams. This is part of the SHARE project, and will be used to create a free and open dataset of research (meta)data. Data collected can be explored at https://osf.io/share/, and viewed at https://osf.io/api/v1/share/search/. Developer docs can be viewed at https://osf.io/wur56/wiki
Apache License 2.0
41 stars 45 forks source link

Develop means of metadata schema versioning #170

Open erinspace opened 9 years ago

erinspace commented 9 years ago

Same as https://github.com/CenterForOpenScience/SHARE/issues/156

When a schema from a provider changes, we'll need to be able to specify which version of the schema we'd like to normalize against.

For example, when pubmed central changed their metadata, we could have used a versioned schema to make those changes retroactively, and used the new schema in the future.

fabianvf commented 9 years ago

Proposal: Create a Schema class that is initialized with 2 values, the first a function that takes a metadata record and returns a boolean, and the second a dictionary that defines the schema (what we currently have). Create a field in the harvesters called schemas, which is a list of Schema objects. When normalizing a document, iterate through that list and use the first schema which returns True when given the metadata document. The last Schema entry in the list will be considered the default (it will have a function that always returns True). @jeffspies, @chrisseto @erinspace, thoughts?

class Schema(object):
    def __init__(self, schema, fn=lambda x: True):
        self.schema = schema
        self.matches = fn

class BaseHarvester(object):
    @property
    def schema(self):
        matches = filter(lambda x: x.matches(doc), self.schemas)
        assert len(matches) == 1
        return matches[0]

    @abc.abstractproperty
    def schemas(self):
        raise NotImplementedError