ENH: allow config to specify priority of sources

jsheunis commented 1 year ago

Should be implemented in parallel with https://github.com/datalad/datalad-catalog/issues/82.

The current decision tree for metadata source specification:

If a single source is specified in config, display that source if provided in incoming metadata; if not provided, display on a first come first served basis (until single source is provided).
If a list of sources is specified, display them all as options (with first arbitrarily being default) and let user select which one to render in browser.
If merge, merge all incoming sources.
If nothing specified in config, first come first served.

Ideally, the list of sources should also be able to connect a priority number to each element in the list (i.e. switch to dictionary). And the catalog metadata should keep track of which content comes from which source, in order to allow updates to be readjusted accordingly.

@mslw please add anything I'm missing.

jsheunis commented 1 year ago

I'm thinking again about:

If a list of sources is specified, display them all as options (with first arbitrarily being default) and let user select which one to render in browser.

On second thought, I actually don't think it's that smart to leave the display choice up to the user. If we think about the likely average user of a catalog, they will probably have zero context for where a particular metadata item in a dataset comes from, or how it was aggregated from various sources. They just want to see the available metadata. So the UI options might be confusing.

Perhaps there is room for another item in the decision tree though: a list of sources to merge, i.e. merge all incoming sources if they are in the provided list; if not, ignore.

jsheunis commented 1 year ago

Might make sense to find a generalised way of dealing with these rules, so that adding future rules won't require extensive refactoring.

mslw commented 1 year ago

Great overview, very little to add from my side apart from the way I saw it in my mind.

To add a specific use case as an example: we may have a data deposition policy which says that we prefer using CFF files to declare authorship, but allow other common annotation formats. With that, we would want to favor explicit user-provided metadata over git history. So for authors we could specify a priority-list as ["cff", "datacite_gin", "studyminimeta", "metalad_core"], and out of what's present, the thing highest on the list would take precedence. I would appreciate such a feature myself.

Worth highlighting that currently the configuration is applied on-add. If we add keeping track of the content source, the behavior wouln't have to change, and priority-list could still be evaluated on-add. That is, in the example below, if the catalog has values from "datacite_gin", it would be overwritten by "cff" but not "studyminimeta" incoming later.

I also thought that an additional option could be a "list-to-merge", that is "merge if it comes from these, discard otherwise". This could be a way to implement exclusions. But as I come up with additional versions of logic, the use cases might become much less common.

Regarding implementation, the priority-list does not need to be a dictionary. If we keep using list for list-of-sources-(toggled by user), we can use tuple as priority-list. Or, if we want to have multiple meanings for a list of options, I think we can safely reserve some keywords and use first list item to specify mode. So, e.g. no keyword ["foo", "bar", "baz"] would be interpreted as before, three extractors to be shown with user selection. But ["priority", "foo", "bar", "baz"] would be a priority-list, and we could have one for merge, and maybe others. If we think "priority" could clash with extractor names, then maybe start with some special character ("#priority") or make it a tuple.

jsheunis commented 1 year ago

Thanks!

A summary of my current thoughts:

a config operates on the dataset level (currently catalog level)
config is applied on metadata add (through CLI or Python API) and not interpreted in real-time in the JS client

introduce more generalised rules for metadata updates; perhaps something like a dict per metadata field, where a 'source' key will provide a list of relevant metadata sources , and a 'rule' key will specify how the list of sources should be interpreted. E.g.:

config = {
...
"property_sources": {
    "dataset": {
        ...
        "description": {
            "rule": "single",
            "source": ["metalad_studyinimeta"]
        },
        "authors": {
            "rule": "merge",
            "source": ["metalad_studyinimeta", "bids_dataset", "datacite_gin"]
        },
        "keywords": {
            "rule": "merge",
            "source": ["all"]
        },
        ...
    },
    "file": {}
}
...
}

Tracking of which source supplied metadata for a specific field will be tracked in the existing extractors_used field on the dataset or file metadata record (field to be renamed to something like metadata_sources to make it more general). The alternative would be to track the supplied metadata for each field in the metadata record in the field itself, which is basically the same as collecting provenance for each field of each record. This would be a great feature but unnecessary for any current or foreseen usecase. It would also complicate and require extensive changes to the schema.
the rendered catalog will not give the user the capability to choose to display different content based on source (if this desirable for a specific use case, it could be revisited)

jsheunis commented 1 year ago

Closed by https://github.com/datalad/datalad-catalog/pull/237

datalad / datalad-catalog

ENH: allow config to specify priority of sources #233