datalad / datalad-catalog

Create a user-friendly data catalog from structured metadata
https://datalad-catalog.netlify.app
MIT License
15 stars 11 forks source link

ENH: refactoring config, extractors_used #237

Closed jsheunis closed 1 year ago

jsheunis commented 1 year ago

This PR aims to address the following issues:

It results in the following changes:

  1. A catalog-level config can be specified with create, e.g. datalad catalog create -c mycatalog -y catalog_config.json. This created a config file at mycatalog/config.json.
    • The catalog level config specifies things like the catalog name, path to a logo file, and the default rules for rendering metadata on a dataset level (e.g. when to merge fields derived from multiple sources, which ones to show or ignore, etc)
    • The catalog -level config serves as the default for dataset-level config. If not specified on the dataset level, the dataset rendering rules will be inherited from the catalog level.
    • If not specified by the user on the catalog-level, a default built-in config file is used.
  2. A dataset-level config can be specified with add, e.g. datalad catalog add -c mycatalog -m metadata.jsonl -y dataset_config.json. This created a config file at mycatalog/metadata/<dataset_id>/<dataset_version>/config.json.
    • This overrides the default specified on the catalog level, for the specific dataset.
    • This file will be created for all datasets with metadata items in the metadata.jsonl file (i.e. a json line)
    • If a config is not specified on the dataset level, a duplicate of the catalog-level config.json will NOT be created for the specific dataset being processed, as this will be unnecessary duplication and disk/memory usage.
  3. The structure of config.json has changed substantially. Here's an example:

    config = {
    ...
    "property_sources": {
        "dataset": {
            ...
            "description": {
                "rule": "single",
                "source": ["metalad_studyinimeta"]
            },
            "authors": {
                "rule": "priority",
                "source": ["metalad_studyinimeta", "bids_dataset", "datacite_gin"]
            },
            "keywords": {
                "rule": "merge",
                "source": "any"
            },
            "publications": {
                "rule": "merge",
                "source": ["metalad_studyinimeta", "bids_dataset"]
            },
            ...
        },
        "file": {}
    }
    ...
    }
    • The value of each key of the dataset dictionary can be populated from specified sources based on specified rules
    • A rule can be: single (only one source), merge (merge sources together), priority (select only one source from a list of sources, prioritised based on the order in which they appear in the list of sources).
    • If no rule is specified, the default rule is first-come-first-served
    • A source is generally a list of strings, with a single element (usually related to the single rule) or multiple elements (related to the priority or merge rules).
    • The source can also be any, meaning that any sources are allowed.
  4. When metadata from a specific source is added to a catalog, the config is loaded (from the file specified on the dataset level, or inherited from the catalog level) and this provides the specification (rules and sources) according to which all key-value pairs of the incoming dictionary is evaluated and populated into the catalog metadata.
  5. The catalog metadata for a dataset keeps track of which sources supplied the values for which keys in the metadata dictionary. This is done in order to allow metadata to be updated according to the config-specified rules and sources. E.g. if the dataset name has a current value supplied by source_B, while the config specifies that the name field can be populated by a number of sources in order of priority (["source_A", "source_B", "source_C"]), and then a catalog update is made that supplies a new value for name from source_A: this should result in the new value for name being populated from source_A, and in this source information being tracked. This is done as follows in the metadata dictionary for the dataset in the catalog:
    
    {
    "type": "dataset",
    "dataset_id": "....",
    "name": "value_from_source_B",
    ...
    "metadata_sources": {
    "key_source_map": {
      "type": ["metalad_core"],
      "dataset_id": ["metalad_core"],
      "name": ["source_B"],
      ...
    },
    "sources": [
      {
        "source_name": "metalad_core",
        "source_version": "0.0.1",
        "source_parameter": {},
        "source_time": 1643901350.65269,
        "agent_name": "Stephan Heunis",
        "agent_email": "s.heunis@fz-juelich.de"
      },
      {
        "source_name": "source_B",
        "source_version": "2",
        "source_parameter": {},
        "source_time": 1643901350.65269,
        "agent_name": "Stephan Heunis",
        "agent_email": "s.heunis@fz-juelich.de"
      },
    ]
    }
    }

6. As can be seen in the above dictionary, the structure of `extractors_used` (now `metadata_sources`) has also changed substantially:
   - the naming convention is abstracted, since metadata won't necessarily always originate from an extractor in the datalad/metalad sense.
   - `metadata_sources["sources"]` contains a list of metadata sources (with extra info such as version, agent, etc) that have provided content for this particular metadata record.
   - `metadata_sources["key_source_map"]` provides a mapping of which metadata sources were used to provide content for which specific keys in the metadata record. This tracking is necessary in order to decide (together with config-based rules and sources) how to update the record in future.

During the preparation on this PR, the importance of updating the metadata translation functionality came to light again, and as a result PR https://github.com/datalad/datalad-catalog/pull/246 was merged into this one.

To do:
- [x] Update workflow tests
- [x] Update demo catalog metadata and config
- [x] Update catalog schemas
- [x] Update translators, connected to `jq` dependency
- [x] Reassess / refactor `catalog_action` parameter and related code
jsheunis commented 1 year ago

All tests now run successfully except for test_workflows.py, which requires an update to the translation maps between extracted metalad output and the catalog schema. This is because the catalog schema was updated based on the updated config functionality and metadata source structure. The changes to the mapping could be done in parallel with https://github.com/datalad/datalad-catalog/issues/224, since both relate to the use of jq. This would probably be sensible to separate into a new PR.

However, this PR is still ready for reviewing all functionality except workflows in the meantime.

mslw commented 1 year ago

I think it's a very nice overhaul. I like the suggested way of handling configuration, and I totally appreciate the explanation above. In addition to the explanation, the code for updating entries is clearly structured. And there are great tests.

For now, I read the descriptions and browsed the changed code. I have yet to try creating/adding entries with the new rules, which would help me better understand the new behavior (need to craft some metadata and configs first).

Here are my general questions and comments:

  1. What is the difference between catalog/config.json and config/config.json files?
  2. I see that config/config.json has e.g. "authors": {"rule": "merge", "source": ["metalad_studyminimeta"]}. Does that mean that by default authors would be displayed only from studyminimeta (so if only entries coming from metalad core were added, nothing would be shown)? I think broader defaults would be better.
  3. Is there a difference between {"rule": "merge", "source": ["xyz"]} and {"rule": "single", "source": ["xyz"]}? I think I have seen both.
  4. It seems that catalog schema needs to be updated considering that extractors_used now became metadata_sources.
  5. Which would also mean that existing translators need to be updated, so that they create the metadata_sources field?
  6. If I understood correctly, if only partial config (i.e. a config file with rules defined for a subset of available fields, e.g. only a rule for authors) is given to catalog create, that partial config would be saved in the catalog. Next, add operations would consider that file and no longer look at defaults. Meaning that default for fields not defined would be essentially none. Have you considered doing a merge (use user provided default if given, package default otherwise)?
  7. The init method for the WebCatalog now has catalog_action parameter. Can a catalog instance be reused if the action needs to be specified upfront (e.g. c = WebCatalog(...); c.create(...); c.add(...))?

I wrote down a few code-specific suggestions for code, I'll add them later today or tomorrow.

jsheunis commented 1 year ago

Thanks @mslw !

What is the difference between catalog/config.json and config/config.json files?

The catalog directory is a self-sufficient catalog that is hosted via github pages and renders the repo's demo catalog. It therefore serves as a demo, while simultaneously also containing template files that use copied when generating a new catalog with the package. The catalog/config.json file, however, is specific to the demo catalog, and not intended to be a general template. That is why config/config.json exists, to be the template/default catalog-level config for any newly generated catalog.

I see that config/config.json has e.g. "authors": {"rule": "merge", "source": ["metalad_studyminimeta"]}. Does that mean that by default authors would be displayed only from studyminimeta (so if only entries coming from metalad core were added, nothing would be shown)? I think broader defaults would be better.

Your interpretation is correct, and I agree we should take another look at the default config. What do you think is a sensible default for most fields? I was thinking first-come-first-served for fields that receive single strings/objects, and merge - any for lists. Haven't tried it yet.

Is there a difference between {"rule": "merge", "source": ["xyz"]} and {"rule": "single", "source": ["xyz"]}? I think I have seen both.

No, they will lead to the same output; although the former allows for adding sources without changing the rule, but this is not necessarily a huge benefit.

It seems that catalog schema needs to be updated considering that extractors_used now became metadata_sources.

Yes! Thanks for checking that. Adding as a TODO.

Which would also mean that existing translators need to be updated, so that they create the metadata_sources field?

Indeed. Another TODO. Mentioned here: https://github.com/datalad/datalad-catalog/pull/237#issuecomment-1381606971

If I understood correctly, if only partial config (i.e. a config file with rules defined for a subset of available fields, e.g. only a rule for authors) is given to catalog create, that partial config would be saved in the catalog. Next, add operations would consider that file and no longer look at defaults. Meaning that default for fields not defined would be essentially none. Have you considered doing a merge (use user provided default if given, package default otherwise)?

I have considered yes, but nothing implemented. I think it would be useful to have some explicit way of dealing with new/updated configs. Might need a separate catalog config subcommand if it comes to that, or at least another flag for add that specifies how to handle existing configs if new ones are provided.

The init method for the WebCatalog now has catalog_action parameter. Can a catalog instance be reused if the action needs to be specified upfront (e.g. c = WebCatalog(...); c.create(...); c.add(...))?

Yes and no. As is, I am pretty sure (not 100%) it can be reused without trouble for any operations that do not need the config. it cannot be reused without implications when doing e.g. an add when the instance was instantiated with action create. And I think this is a problem. I need to have another look at that code (TODO) to see how else this action could be specified, since it has knock-on effects that might require refactoring code on other levels. Comments welcome.

mslw commented 1 year ago

Thanks for the responses, they helped a lot (about the catalog directory, I should have realised :D).

Your interpretation is correct, and I agree we should take another look at the default config. What do you think is a sensible default for most fields? I was thinking first-come-first-served for fields that receive single strings/objects, and merge - any for lists. Haven't tried it yet.

I would agree. And for subjects, it used to be merge - any. I know I didn't like that very much for myself (dealing with inconsistent git identities), but I still think that's the best default now that it can be changed easily. Keeping metalad_core as source for some fields makes sense to me, otherwise I don't see reasons to be restrictive.

jsheunis commented 1 year ago

LOL looks like one of the suggestions broke something. I'll fix it in a next commit.

jsheunis commented 1 year ago

I would agree. And for subjects, it used to be merge - any. I know I didn't like that very much for myself (dealing with inconsistent git identities), but I still think that's the best default now that it can be changed easily.

@mslw with subjects do you mean authors?

mslw commented 1 year ago

Yes, authors :face_with_spiral_eyes: :rofl:

mslw commented 1 year ago

LOL looks like one of the suggestions broke something. I'll fix it in a next commit.

My bad, dae8395 introduced an unnecessary : at the end of line 150 of webcatalog.py.

mslw commented 1 year ago

As we discussed today with @jsheunis, I tried to do some user testing (using this PR branch). I tried datalad catalog add to a catalog generated with default settings. I used metadata extracted with metalad_core and translated by my own script adjusted to new kwywords (so don't take the translation correctness as granted!). This is the json line from the extracted & translated metadata that I used:

{"type": "dataset", "dataset_id": "e132ac40-30c5-457d-8c7c-0dcbdbb95d9a", "dataset_version": "79cfdaa67c5872a44a79f54754a9ea3e68db5713", "name": "", "url": [], "authors": [{"name": "Michał Szczepanik", "email": "m.szczepanik@fz-juelich.de"}], "subdatasets": [{"dataset_id": "5a11d594-243e-4ca3-819b-54d10c8d6b3f", "dataset_version": "31f151d9b6f40da4d658e4096f011558bb889618", "dataset_path": "A02", "dirs_from_path": []}, {"dataset_id": "972860f9-75b9-4ecc-b546-99e1d6aad5f9", "dataset_version": "7a2afe4b9d956a5d2db5547f2768c64f746bbebe", "dataset_path": "B04", "dirs_from_path": []}, {"dataset_id": "7411dec6-abc4-45a8-890d-0451af8c6297", "dataset_version": "7d7aa226b68da252c199e938fbde3d9923a42bac", "dataset_path": "B05", "dirs_from_path": []}, {"dataset_id": "2dc82a59-82c1-444f-a440-db4cf1aca056", "dataset_version": "a68ffd915208074b3b121a982eabb2d3563fa9a2", "dataset_path": "C01", "dirs_from_path": []}, {"dataset_id": "441ef45f-c525-468e-aeb8-5cf3b7113b16", "dataset_version": "2330c1bdf6b4298f762b13cf4689bdbff3a8684b", "dataset_path": "C05", "dirs_from_path": []}, {"dataset_id": "80cc995e-7d8e-42c7-93d3-9b51ba1d77ed", "dataset_version": "dfc3f9aec72a09b8b17d80d205bd31ad98604456", "dataset_path": "INF", "dirs_from_path": []}, {"dataset_id": "87a777cc-eae0-4446-9178-4440a4eca038", "dataset_version": "ba1f2520c8a76f80238413a4a0eb76fbfbd5c7e6", "dataset_path": "Z03", "dirs_from_path": []}], "metadata_sources": {"key_source_map": {}, "sources": [{"source_name": "metalad_core", "source_version": "1", "extraction_parameter": {}, "extraction_time": 1674227278.5175016, "agent_name": "Michał Szczepanik", "agent_email": "m.szczepanik@fz-juelich.de"}]}}

I got an error which I was able to pinpoint to the next() statement in add_metadata_source: https://github.com/datalad/datalad-catalog/blob/87948d8c1b8752ccfd5741637252e11e775d1051/datalad_catalog/webcatalog.py#L630-L657

This is the debugger output at a breakpoint placed on line 643:

> /home/mszczepanik/Documents/datalad-catalog/datalad_catalog/webcatalog.py(644)add_metadata_source()
-> extractor_found = next(
(Pdb) pp source_dict
{'agent_email': 'm.szczepanik@fz-juelich.de',
 'agent_name': 'Michał Szczepanik',
 'extraction_parameter': {},
 'extraction_time': 1674227278.5175016,
 'source_name': 'metalad_core',
 'source_version': '1'}
(Pdb) pp self.metadata_sources
{'key_source_map': {'subdatasets': [None]}, 'sources': [{}]}
(Pdb) c
[ERROR  ] 'source_name' 

Note the contents of the key_source_map above. I did not figure out what exactly leads to the error.

Random thought just now, not sure if it makes sense... the default config would only use metalad_studyminimeta for authors - could that be a clue as to why we have None above?

jsheunis commented 1 year ago

@mslw I tried to reproduce the error on my side, but failed.

  1. First, I reset the state of the branch to the commit before the latest code suggestions were committed: git reset --hard c4636481e269583e2b1189e97a9edaf5c3691a66 I did this because otherwise I received an error when trying to create a catalog: 'WebCatalog' object has no attribute 'config' (AttributeError), so it seemed that one of the suggestions also broke something else (other than the extra : mentioned above). After the reset, and after a new pip install -e . just to be sure, the catalog was created without an issue.
  2. I put your line of translated metadata into a new file, and added this to the catalog with a normal datalad catalog add. this worked fine: catalog_add(ok): /Users/jsheunis/Documents/psyinf/Data/test_catalog [Metadata items successfully added to catalog]
  3. This is the partial content of the metadata file in the catalog after the add operation:

    
    "metadata_sources": {
    "key_source_map": {
      "type": [
        "metalad_core"
      ],
      "dataset_id": [
        "metalad_core"
      ],
      "dataset_version": [
        "metalad_core"
      ],
      "subdatasets": [
        "metalad_core"
      ]
    },
    "sources": [
      {
        "source_name": "metalad_core",
        "source_version": "1",
        "extraction_parameter": {},
        "extraction_time": 1674227278.5175016,
        "agent_name": "Michał Szczepanik",
        "agent_email": "m.szczepanik@fz-juelich.de"
      }
    ]
    }```

So I think we should probably have another session to debug.

mslw commented 1 year ago

Oof, that was a false alarm. Thanks for checking and for persevering with me. Long story short:

Sorry about that!

jsheunis commented 1 year ago

ALL GREEN, YAY!!!