datalad / datalad-catalog

Create a user-friendly data catalog from structured metadata
https://datalad-catalog.netlify.app
MIT License
15 stars 12 forks source link

NF: Adds metadata translation functionality in dedicated class #246

Closed jsheunis closed 1 year ago

jsheunis commented 1 year ago

This PR:

TODO:

Old (but still perhaps useful to have documented here)

Sample testing code 1:

from pathlib import Path
from datalad_catalog import (
    translate,
    utils,
)
metadata_file = Path('datalad_catalog/tests/data/metadata_datacite_gin.json')
metadata_record = utils.read_json_file(metadata_file)
translate.Translate(metadata_record).run_translator()

Sample testing code 2:

from datalad_catalog.translate import get_translators
ts = get_translators()
ts_datacite = ts['datacite_gin_translator']
inst = ts_datacite['loader']()()
inst.match('datacite_gin', '0.0.1')
jsheunis commented 1 year ago

@mslw:

This is a massive change, in the best meaning of this word ;) Thanks for preparing these changes. I did not play with the PR, so it's a code review in a literal sense.

In the proposed form, TranslatorBase.match() takes source name and version - it would be good to have source ID it as an optional argument, so that we could utilise this feature of MetaLad extractors.

Agree 👍

If a translator wanted to implement a more complex logic than 1:1 match of version & name (and I think it should be left to the specific translator to implement - extractors may use whatever versioning schema), it would need to override match(), and still provide get_supported_extractor_version() & get_supported_extractor_name() anyway (if I understand @abc.abstractmethod correctly, a derived class needs to override all extracted methods before it can be instantiated). If the implementation of a TranslatorBase was up to me, I would do without get_supported_extractor_version() & get_supported_extractor_name(), and instead make match() an abstract class method:

@classmethod @abstractmethod def match(cls, source_name: str, source_version: str, source_id: str | None = None) -> bool: This way, match logic would be left entirely to the specific implementation, finding a matching translator class would not require instantiation, and splitting translator implementation into two classes (one for abstract methods, one for translation) would not be necessary.

I agree with this reasoning. One thing that I'm hesitant about is what to do in the scenario where translators inheriting from the base class create complex matching algorithms (in their overridden match method) that need to access instance methods because otherwise the. This would not be possible (e.g. I won't be able to access the self.get_supported_extractor_version() in the new class function) and such logic would have to be implemented in a different class or maybe in class-less functions in the same module get_supported_extractor_version(). This is not necessarily a big issue and can be done, but perhaps there's a less hacky alternative? (Update: I guess these other methods could also be more class methods themselves...)