Closed jsheunis closed 1 year ago
All tests now run successfully except for test_workflows.py
, which requires an update to the translation maps between extracted metalad
output and the catalog schema. This is because the catalog schema was updated based on the updated config functionality and metadata source structure. The changes to the mapping could be done in parallel with https://github.com/datalad/datalad-catalog/issues/224, since both relate to the use of jq
. This would probably be sensible to separate into a new PR.
However, this PR is still ready for reviewing all functionality except workflows in the meantime.
I think it's a very nice overhaul. I like the suggested way of handling configuration, and I totally appreciate the explanation above. In addition to the explanation, the code for updating entries is clearly structured. And there are great tests.
For now, I read the descriptions and browsed the changed code. I have yet to try creating/adding entries with the new rules, which would help me better understand the new behavior (need to craft some metadata and configs first).
Here are my general questions and comments:
catalog/config.json
and config/config.json
files?config/config.json
has e.g. "authors": {"rule": "merge", "source": ["metalad_studyminimeta"]}
. Does that mean that by default authors would be displayed only from studyminimeta (so if only entries coming from metalad core were added, nothing would be shown)? I think broader defaults would be better.{"rule": "merge", "source": ["xyz"]}
and {"rule": "single", "source": ["xyz"]}
? I think I have seen both.catalog create
, that partial config would be saved in the catalog. Next, add operations would consider that file and no longer look at defaults. Meaning that default for fields not defined would be essentially none. Have you considered doing a merge (use user provided default if given, package default otherwise)?catalog_action
parameter. Can a catalog instance be reused if the action needs to be specified upfront (e.g. c = WebCatalog(...); c.create(...); c.add(...)
)?I wrote down a few code-specific suggestions for code, I'll add them later today or tomorrow.
Thanks @mslw !
What is the difference between catalog/config.json and config/config.json files?
The catalog
directory is a self-sufficient catalog that is hosted via github pages and renders the repo's demo catalog. It therefore serves as a demo, while simultaneously also containing template files that use copied when generating a new catalog with the package. The catalog/config.json
file, however, is specific to the demo catalog, and not intended to be a general template. That is why config/config.json
exists, to be the template/default catalog-level config for any newly generated catalog.
I see that config/config.json has e.g. "authors": {"rule": "merge", "source": ["metalad_studyminimeta"]}. Does that mean that by default authors would be displayed only from studyminimeta (so if only entries coming from metalad core were added, nothing would be shown)? I think broader defaults would be better.
Your interpretation is correct, and I agree we should take another look at the default config. What do you think is a sensible default for most fields? I was thinking first-come-first-served
for fields that receive single strings/objects, and merge - any
for lists. Haven't tried it yet.
Is there a difference between {"rule": "merge", "source": ["xyz"]} and {"rule": "single", "source": ["xyz"]}? I think I have seen both.
No, they will lead to the same output; although the former allows for adding sources without changing the rule, but this is not necessarily a huge benefit.
It seems that catalog schema needs to be updated considering that extractors_used now became metadata_sources.
Yes! Thanks for checking that. Adding as a TODO.
Which would also mean that existing translators need to be updated, so that they create the metadata_sources field?
Indeed. Another TODO. Mentioned here: https://github.com/datalad/datalad-catalog/pull/237#issuecomment-1381606971
If I understood correctly, if only partial config (i.e. a config file with rules defined for a subset of available fields, e.g. only a rule for authors) is given to catalog create, that partial config would be saved in the catalog. Next, add operations would consider that file and no longer look at defaults. Meaning that default for fields not defined would be essentially none. Have you considered doing a merge (use user provided default if given, package default otherwise)?
I have considered yes, but nothing implemented. I think it would be useful to have some explicit way of dealing with new/updated configs. Might need a separate catalog config
subcommand if it comes to that, or at least another flag for add
that specifies how to handle existing configs if new ones are provided.
The init method for the WebCatalog now has catalog_action parameter. Can a catalog instance be reused if the action needs to be specified upfront (e.g. c = WebCatalog(...); c.create(...); c.add(...))?
Yes and no. As is, I am pretty sure (not 100%) it can be reused without trouble for any operations that do not need the config. it cannot be reused without implications when doing e.g. an add
when the instance was instantiated with action create
. And I think this is a problem. I need to have another look at that code (TODO) to see how else this action could be specified, since it has knock-on effects that might require refactoring code on other levels. Comments welcome.
Thanks for the responses, they helped a lot (about the catalog directory, I should have realised :D).
Your interpretation is correct, and I agree we should take another look at the default config. What do you think is a sensible default for most fields? I was thinking
first-come-first-served
for fields that receive single strings/objects, andmerge - any
for lists. Haven't tried it yet.
I would agree. And for subjects, it used to be merge - any
. I know I didn't like that very much for myself (dealing with inconsistent git identities), but I still think that's the best default now that it can be changed easily. Keeping metalad_core
as source for some fields makes sense to me, otherwise I don't see reasons to be restrictive.
LOL looks like one of the suggestions broke something. I'll fix it in a next commit.
I would agree. And for subjects, it used to be merge - any. I know I didn't like that very much for myself (dealing with inconsistent git identities), but I still think that's the best default now that it can be changed easily.
@mslw with subjects
do you mean authors
?
Yes, authors
:face_with_spiral_eyes: :rofl:
LOL looks like one of the suggestions broke something. I'll fix it in a next commit.
My bad, dae8395 introduced an unnecessary :
at the end of line 150 of webcatalog.py
.
As we discussed today with @jsheunis, I tried to do some user testing (using this PR branch). I tried datalad catalog add
to a catalog generated with default settings. I used metadata extracted with metalad_core
and translated by my own script adjusted to new kwywords (so don't take the translation correctness as granted!). This is the json line from the extracted & translated metadata that I used:
{"type": "dataset", "dataset_id": "e132ac40-30c5-457d-8c7c-0dcbdbb95d9a", "dataset_version": "79cfdaa67c5872a44a79f54754a9ea3e68db5713", "name": "", "url": [], "authors": [{"name": "Michał Szczepanik", "email": "m.szczepanik@fz-juelich.de"}], "subdatasets": [{"dataset_id": "5a11d594-243e-4ca3-819b-54d10c8d6b3f", "dataset_version": "31f151d9b6f40da4d658e4096f011558bb889618", "dataset_path": "A02", "dirs_from_path": []}, {"dataset_id": "972860f9-75b9-4ecc-b546-99e1d6aad5f9", "dataset_version": "7a2afe4b9d956a5d2db5547f2768c64f746bbebe", "dataset_path": "B04", "dirs_from_path": []}, {"dataset_id": "7411dec6-abc4-45a8-890d-0451af8c6297", "dataset_version": "7d7aa226b68da252c199e938fbde3d9923a42bac", "dataset_path": "B05", "dirs_from_path": []}, {"dataset_id": "2dc82a59-82c1-444f-a440-db4cf1aca056", "dataset_version": "a68ffd915208074b3b121a982eabb2d3563fa9a2", "dataset_path": "C01", "dirs_from_path": []}, {"dataset_id": "441ef45f-c525-468e-aeb8-5cf3b7113b16", "dataset_version": "2330c1bdf6b4298f762b13cf4689bdbff3a8684b", "dataset_path": "C05", "dirs_from_path": []}, {"dataset_id": "80cc995e-7d8e-42c7-93d3-9b51ba1d77ed", "dataset_version": "dfc3f9aec72a09b8b17d80d205bd31ad98604456", "dataset_path": "INF", "dirs_from_path": []}, {"dataset_id": "87a777cc-eae0-4446-9178-4440a4eca038", "dataset_version": "ba1f2520c8a76f80238413a4a0eb76fbfbd5c7e6", "dataset_path": "Z03", "dirs_from_path": []}], "metadata_sources": {"key_source_map": {}, "sources": [{"source_name": "metalad_core", "source_version": "1", "extraction_parameter": {}, "extraction_time": 1674227278.5175016, "agent_name": "Michał Szczepanik", "agent_email": "m.szczepanik@fz-juelich.de"}]}}
I got an error which I was able to pinpoint to the next()
statement in add_metadata_source
:
https://github.com/datalad/datalad-catalog/blob/87948d8c1b8752ccfd5741637252e11e775d1051/datalad_catalog/webcatalog.py#L630-L657
This is the debugger output at a breakpoint placed on line 643:
> /home/mszczepanik/Documents/datalad-catalog/datalad_catalog/webcatalog.py(644)add_metadata_source()
-> extractor_found = next(
(Pdb) pp source_dict
{'agent_email': 'm.szczepanik@fz-juelich.de',
'agent_name': 'Michał Szczepanik',
'extraction_parameter': {},
'extraction_time': 1674227278.5175016,
'source_name': 'metalad_core',
'source_version': '1'}
(Pdb) pp self.metadata_sources
{'key_source_map': {'subdatasets': [None]}, 'sources': [{}]}
(Pdb) c
[ERROR ] 'source_name'
Note the contents of the key_source_map
above. I did not figure out what exactly leads to the error.
Random thought just now, not sure if it makes sense... the default config would only use metalad_studyminimeta
for authors - could that be a clue as to why we have None
above?
@mslw I tried to reproduce the error on my side, but failed.
git reset --hard c4636481e269583e2b1189e97a9edaf5c3691a66
I did this because otherwise I received an error when trying to create a catalog: 'WebCatalog' object has no attribute 'config' (AttributeError)
, so it seemed that one of the suggestions also broke something else (other than the extra :
mentioned above).
After the reset, and after a new pip install -e .
just to be sure, the catalog was created without an issue.datalad catalog add
. this worked fine:
catalog_add(ok): /Users/jsheunis/Documents/psyinf/Data/test_catalog [Metadata items successfully added to catalog]
This is the partial content of the metadata file in the catalog after the add operation:
"metadata_sources": {
"key_source_map": {
"type": [
"metalad_core"
],
"dataset_id": [
"metalad_core"
],
"dataset_version": [
"metalad_core"
],
"subdatasets": [
"metalad_core"
]
},
"sources": [
{
"source_name": "metalad_core",
"source_version": "1",
"extraction_parameter": {},
"extraction_time": 1674227278.5175016,
"agent_name": "Michał Szczepanik",
"agent_email": "m.szczepanik@fz-juelich.de"
}
]
}```
So I think we should probably have another session to debug.
Oof, that was a false alarm. Thanks for checking and for persevering with me. Long story short:
self.config
instead of self.catalog_config
main
branch, no wonder it had config issues... Worked well when I repeated everything with this PR.Sorry about that!
ALL GREEN, YAY!!!
This PR aims to address the following issues:
It results in the following changes:
datalad catalog create -c mycatalog -y catalog_config.json
. This created a config file atmycatalog/config.json
.datalad catalog add -c mycatalog -m metadata.jsonl -y dataset_config.json
. This created a config file atmycatalog/metadata/<dataset_id>/<dataset_version>/config.json
.metadata.jsonl
file (i.e. a json line)config.json
will NOT be created for the specific dataset being processed, as this will be unnecessary duplication and disk/memory usage.The structure of
config.json
has changed substantially. Here's an example:dataset
dictionary can be populated from specified sources based on specified rulesrule
can be:single
(only one source),merge
(merge sources together),priority
(select only one source from a list of sources, prioritised based on the order in which they appear in the list of sources).first-come-first-served
source
is generally a list of strings, with a single element (usually related to thesingle
rule) or multiple elements (related to thepriority
ormerge
rules).source
can also beany
, meaning that any sources are allowed.name
has a current value supplied bysource_B
, while the config specifies that thename
field can be populated by a number of sources in order of priority (["source_A", "source_B", "source_C"]
), and then a catalog update is made that supplies a new value forname
fromsource_A
: this should result in the new value forname
being populated fromsource_A
, and in this source information being tracked. This is done as follows in the metadata dictionary for the dataset in the catalog: