NF: Adds metadata translation functionality in dedicated class

datalad / datalad-catalog

Create a user-friendly data catalog from structured metadata

MIT License

15 stars 12 forks source link

This PR:

Is in response to this comment: https://github.com/datalad/datalad-catalog/issues/224#issuecomment-1401637670
Builds on top of and is made against: https://github.com/datalad/datalad-catalog/pull/237
Adds the abstract base class TranslatorBase from which any extension providing a new metadata translator should inherit (this follows a very similar design to the metalad implementation of a base extractor class)
- By overriding a number of base class definitions, translators should provide the name and version of the extractor as well the version of the catalog schema that they are compatible with, and translators can also provide their own logic for translation (which could depend e.g. on jq or not)
Adds a Translate class which is instantiated with a metadata record in order to:
- match the incoming metadata to an appropriate translator (by inspecting translators added as entry points and returning their match methods)
- run metadata translation if an appropriate translator is found
- Adds translate as a catalog subcommand (to be refactored later in bulk via https://github.com/datalad/datalad-catalog/issues/245))
- Adds translator implementations for datacite_gin, bids_dataset, metalad_studyminimeta, and metalad_core based on the above classes as well as @mslw implementation here: https://github.com/mslw/datalad-wackyextra/blob/main/datalad_wackyextra/translators/datacite.py
- Updates all schemas to comply with the refactored config / metadata_sources setup (see https://github.com/datalad/datalad-catalog/pull/237)
- Updates workflows.py to use the added Translator functionality (removing old translator scripts)
- Updates all existing tests to account for the changes in schemas, translators and workflows.

TODO:

[ ] add translator tests for core (dataset and file), studyminimeta, bids_dataset
[x] update documentation (to be done in bulk as part of https://github.com/datalad/datalad-catalog/pull/237 once this current PR is merged)

Old (but still perhaps useful to have documented here)

Sample testing code 1:

from pathlib import Path
from datalad_catalog import (
    translate,
    utils,
)
metadata_file = Path('datalad_catalog/tests/data/metadata_datacite_gin.json')
metadata_record = utils.read_json_file(metadata_file)
translate.Translate(metadata_record).run_translator()

Sample testing code 2:

from datalad_catalog.translate import get_translators
ts = get_translators()
ts_datacite = ts['datacite_gin_translator']
inst = ts_datacite['loader']()()
inst.match('datacite_gin', '0.0.1')

@mslw:

This is a massive change, in the best meaning of this word ;) Thanks for preparing these changes. I did not play with the PR, so it's a code review in a literal sense.

In the proposed form, TranslatorBase.match() takes source name and version - it would be good to have source ID it as an optional argument, so that we could utilise this feature of MetaLad extractors.

Agree 👍

If a translator wanted to implement a more complex logic than 1:1 match of version & name (and I think it should be left to the specific translator to implement - extractors may use whatever versioning schema), it would need to override match(), and still provide get_supported_extractor_version() & get_supported_extractor_name() anyway (if I understand @abc.abstractmethod correctly, a derived class needs to override all extracted methods before it can be instantiated). If the implementation of a TranslatorBase was up to me, I would do without get_supported_extractor_version() & get_supported_extractor_name(), and instead make match() an abstract class method:

@classmethod @abstractmethod def match(cls, source_name: str, source_version: str, source_id: str | None = None) -> bool: This way, match logic would be left entirely to the specific implementation, finding a matching translator class would not require instantiation, and splitting translator implementation into two classes (one for abstract methods, one for translation) would not be necessary.

I agree with this reasoning. One thing that I'm hesitant about is what to do in the scenario where translators inheriting from the base class create complex matching algorithms (in their overridden match method) that need to access instance methods because otherwise the. This would not be possible (e.g. I won't be able to access the self.get_supported_extractor_version() in the new class function) and such logic would have to be implemented in a different class or maybe in class-less functions in the same module get_supported_extractor_version(). This is not necessarily a big issue and can be done, but perhaps there's a less hacky alternative? (Update: I guess these other methods could also be more class methods themselves...)

datalad / datalad-catalog

NF: Adds metadata translation functionality in dedicated class #246

Old (but still perhaps useful to have documented here)