datalad / datalad-catalog

Create a user-friendly data catalog from structured metadata
https://datalad-catalog.netlify.app
MIT License
15 stars 12 forks source link

Define ideal user experience and workflow for `catalog-translate` #324

Closed jsheunis closed 1 year ago

jsheunis commented 1 year ago

Relevant existing issues:

As part of https://github.com/datalad/datalad-catalog/pull/309 I am creating a new interface for Translate, with one of the goals being to fix existing issues related to translation functionality. I am writing a summary here of the existing functionality and the proposed changes, to get feedback on it before I waste time on implementing the wrong stuff.

Old

The current approach is:

datalad catalog translate --metadata <metalad-metadata-to-be-translated>

New

The envisioned approach v1:

datalad catalog-translate --metadata <metalad-metadata-to-be-translated>

The envisioned approach v2:

datalad catalog-translate --metadata <metalad-metadata-to-be-translated> --translator <name(s) of translator(s)>

The problem with the v2 approach is that the manner in which the translators are supplied as arguments to datalad catalog-translate would have to include the same information that is used by the matching method of these translators, which could vary across translators since these are implemented on the inherited translator classes themselves (and not on the base class). So a call would have to be something like:

datalad catalog-translate --metadata <metalad-metadata-to-be-translated> --translator metalad_core_v0.2.2

and that would have to be parsed in order to allow complete matching. I'm thinking more and more this is unnecessarily complicated and that we should stick with new v1 approach.

Opinions?

mslw commented 1 year ago

For v1, you'd need the factory to index translators by all properties given to match(). IIRC these are: name, version, and (optional) UUID. This should be doable. And if successful, would not introduce two "modes" of processing.

I am not able to predict what performance gain that would have in practice (I have been wrong once on this topic already), but probably it's worth a try to write a basic factory, and try it on a synthetically generated, large-ish set of metadata (either from one extractor, or from multiple, interleaved).

I don't currently understand the argument why the v2 specification must be complex - why not entrypoint name alone? But if you feel v1 may be better, it won't hurt to try v1 first.

jsheunis commented 1 year ago

I am not able to predict what performance gain that would have in practice (I have been wrong once on this topic already), but probably it's worth a try to write a basic factory, and try it on a synthetically generated, large-ish set of metadata (either from one extractor, or from multiple, interleaved).

good idea

I don't currently understand the argument why the v2 specification must be complex - why not entrypoint name alone? But if you feel v1 may be better, it won't hurt to try v1 first.

An example of when entrypoint alone would not be enough would be if there are two translators compatible with an extractor of the same name, but different versions. One would either have to make some assumptions to e.g. select the latest, or the user would have to specify both name and version, or both translators would have to be deliberately instantiated and made part of the factory store and matched correctly at a later stage. All of these requires some decision trees that will turn into additional code, and I don't think this adds much benefit to the user in terms of improved UX (the v1 alternative actually feels easier because it's fewer arguments to be concerned with) nor in terms of improved performance over the v1 alternative (in both cases the matching and instantiation for a to-be-used translator would only happen once)

jsheunis commented 1 year ago

Pragmatic approach implemented in https://github.com/datalad/datalad-catalog/pull/309. This might not be ideal forever, but closing the issue until such time as this becomes insufficient.