Develop means of metadata schema versioning

CenterForOpenScience / scrapi

A data processing pipeline that schedules and runs content harvesters, normalizes their data, and outputs that normalized data to a variety of output streams. This is part of the SHARE project, and will be used to create a free and open dataset of research (meta)data. Data collected can be explored at https://osf.io/share/, and viewed at https://osf.io/api/v1/share/search/. Developer docs can be viewed at https://osf.io/wur56/wiki

Apache License 2.0

41 stars 45 forks source link

Proposal: Create a Schema class that is initialized with 2 values, the first a function that takes a metadata record and returns a boolean, and the second a dictionary that defines the schema (what we currently have). Create a field in the harvesters called schemas, which is a list of Schema objects. When normalizing a document, iterate through that list and use the first schema which returns True when given the metadata document. The last Schema entry in the list will be considered the default (it will have a function that always returns True). @jeffspies, @chrisseto @erinspace, thoughts?

class Schema(object):
    def __init__(self, schema, fn=lambda x: True):
        self.schema = schema
        self.matches = fn

class BaseHarvester(object):
    @property
    def schema(self):
        matches = filter(lambda x: x.matches(doc), self.schemas)
        assert len(matches) == 1
        return matches[0]

    @abc.abstractproperty
    def schemas(self):
        raise NotImplementedError

CenterForOpenScience / scrapi

Develop means of metadata schema versioning #170