datalad / datalad-catalog

Create a user-friendly data catalog from structured metadata
https://datalad-catalog.netlify.app
MIT License
15 stars 12 forks source link

ENH: allow config to specify priority of sources #233

Closed jsheunis closed 1 year ago

jsheunis commented 1 year ago

Should be implemented in parallel with https://github.com/datalad/datalad-catalog/issues/82.

The current decision tree for metadata source specification:

  1. If a single source is specified in config, display that source if provided in incoming metadata; if not provided, display on a first come first served basis (until single source is provided).
  2. If a list of sources is specified, display them all as options (with first arbitrarily being default) and let user select which one to render in browser.
  3. If merge, merge all incoming sources.
  4. If nothing specified in config, first come first served.

Ideally, the list of sources should also be able to connect a priority number to each element in the list (i.e. switch to dictionary). And the catalog metadata should keep track of which content comes from which source, in order to allow updates to be readjusted accordingly.

@mslw please add anything I'm missing.

jsheunis commented 1 year ago

I'm thinking again about:

If a list of sources is specified, display them all as options (with first arbitrarily being default) and let user select which one to render in browser.

On second thought, I actually don't think it's that smart to leave the display choice up to the user. If we think about the likely average user of a catalog, they will probably have zero context for where a particular metadata item in a dataset comes from, or how it was aggregated from various sources. They just want to see the available metadata. So the UI options might be confusing.

Perhaps there is room for another item in the decision tree though: a list of sources to merge, i.e. merge all incoming sources if they are in the provided list; if not, ignore.

jsheunis commented 1 year ago

Might make sense to find a generalised way of dealing with these rules, so that adding future rules won't require extensive refactoring.

mslw commented 1 year ago

Great overview, very little to add from my side apart from the way I saw it in my mind.

To add a specific use case as an example: we may have a data deposition policy which says that we prefer using CFF files to declare authorship, but allow other common annotation formats. With that, we would want to favor explicit user-provided metadata over git history. So for authors we could specify a priority-list as ["cff", "datacite_gin", "studyminimeta", "metalad_core"], and out of what's present, the thing highest on the list would take precedence. I would appreciate such a feature myself.

Worth highlighting that currently the configuration is applied on-add. If we add keeping track of the content source, the behavior wouln't have to change, and priority-list could still be evaluated on-add. That is, in the example below, if the catalog has values from "datacite_gin", it would be overwritten by "cff" but not "studyminimeta" incoming later.

I also thought that an additional option could be a "list-to-merge", that is "merge if it comes from these, discard otherwise". This could be a way to implement exclusions. But as I come up with additional versions of logic, the use cases might become much less common.

Regarding implementation, the priority-list does not need to be a dictionary. If we keep using list for list-of-sources-(toggled by user), we can use tuple as priority-list. Or, if we want to have multiple meanings for a list of options, I think we can safely reserve some keywords and use first list item to specify mode. So, e.g. no keyword ["foo", "bar", "baz"] would be interpreted as before, three extractors to be shown with user selection. But ["priority", "foo", "bar", "baz"] would be a priority-list, and we could have one for merge, and maybe others. If we think "priority" could clash with extractor names, then maybe start with some special character ("#priority") or make it a tuple.

jsheunis commented 1 year ago

Thanks!

A summary of my current thoughts:

jsheunis commented 1 year ago

Closed by https://github.com/datalad/datalad-catalog/pull/237