Closed jsheunis closed 1 year ago
I'm thinking again about:
If a list of sources is specified, display them all as options (with first arbitrarily being default) and let user select which one to render in browser.
On second thought, I actually don't think it's that smart to leave the display choice up to the user. If we think about the likely average user of a catalog, they will probably have zero context for where a particular metadata item in a dataset comes from, or how it was aggregated from various sources. They just want to see the available metadata. So the UI options might be confusing.
Perhaps there is room for another item in the decision tree though: a list of sources to merge, i.e. merge all incoming sources if they are in the provided list; if not, ignore.
Might make sense to find a generalised way of dealing with these rules, so that adding future rules won't require extensive refactoring.
Great overview, very little to add from my side apart from the way I saw it in my mind.
To add a specific use case as an example: we may have a data deposition policy which says that we prefer using CFF files to declare authorship, but allow other common annotation formats. With that, we would want to favor explicit user-provided metadata over git history. So for authors we could specify a priority-list as ["cff", "datacite_gin", "studyminimeta", "metalad_core"]
, and out of what's present, the thing highest on the list would take precedence. I would appreciate such a feature myself.
Worth highlighting that currently the configuration is applied on-add. If we add keeping track of the content source, the behavior wouln't have to change, and priority-list could still be evaluated on-add. That is, in the example below, if the catalog has values from "datacite_gin", it would be overwritten by "cff" but not "studyminimeta" incoming later.
I also thought that an additional option could be a "list-to-merge", that is "merge if it comes from these, discard otherwise". This could be a way to implement exclusions. But as I come up with additional versions of logic, the use cases might become much less common.
Regarding implementation, the priority-list does not need to be a dictionary. If we keep using list for list-of-sources-(toggled by user), we can use tuple as priority-list. Or, if we want to have multiple meanings for a list of options, I think we can safely reserve some keywords and use first list item to specify mode. So, e.g. no keyword ["foo", "bar", "baz"]
would be interpreted as before, three extractors to be shown with user selection. But ["priority", "foo", "bar", "baz"]
would be a priority-list, and we could have one for merge, and maybe others. If we think "priority" could clash with extractor names, then maybe start with some special character ("#priority"
) or make it a tuple.
Thanks!
A summary of my current thoughts:
config = {
...
"property_sources": {
"dataset": {
...
"description": {
"rule": "single",
"source": ["metalad_studyinimeta"]
},
"authors": {
"rule": "merge",
"source": ["metalad_studyinimeta", "bids_dataset", "datacite_gin"]
},
"keywords": {
"rule": "merge",
"source": ["all"]
},
...
},
"file": {}
}
...
}
extractors_used
field on the dataset or file metadata record (field to be renamed to something like metadata_sources
to make it more general). The alternative would be to track the supplied metadata for each field in the metadata record in the field itself, which is basically the same as collecting provenance for each field of each record. This would be a great feature but unnecessary for any current or foreseen usecase. It would also complicate and require extensive changes to the schema.
Should be implemented in parallel with https://github.com/datalad/datalad-catalog/issues/82.
The current decision tree for metadata source specification:
single source
is specified in config, display that source if provided in incoming metadata; if not provided, display on a first come first served basis (until single source is provided).list of sources
is specified, display them all as options (with first arbitrarily being default) and let user select which one to render in browser.merge
, merge all incoming sources.nothing
specified in config, first come first served.Ideally, the
list of sources
should also be able to connect a priority number to each element in the list (i.e. switch to dictionary). And the catalog metadata should keep track of which content comes from which source, in order to allow updates to be readjusted accordingly.@mslw please add anything I'm missing.