Open erinspace opened 9 years ago
Proposal: Create a Schema class that is initialized with 2 values, the first a function that takes a metadata record and returns a boolean, and the second a dictionary that defines the schema (what we currently have). Create a field in the harvesters called schemas, which is a list of Schema objects. When normalizing a document, iterate through that list and use the first schema which returns True when given the metadata document. The last Schema entry in the list will be considered the default (it will have a function that always returns True). @jeffspies, @chrisseto @erinspace, thoughts?
class Schema(object):
def __init__(self, schema, fn=lambda x: True):
self.schema = schema
self.matches = fn
class BaseHarvester(object):
@property
def schema(self):
matches = filter(lambda x: x.matches(doc), self.schemas)
assert len(matches) == 1
return matches[0]
@abc.abstractproperty
def schemas(self):
raise NotImplementedError
Same as https://github.com/CenterForOpenScience/SHARE/issues/156
When a schema from a provider changes, we'll need to be able to specify which version of the schema we'd like to normalize against.
For example, when pubmed central changed their metadata, we could have used a versioned schema to make those changes retroactively, and used the new schema in the future.