datalad / datalad-catalog

Create a user-friendly data catalog from structured metadata
https://datalad-catalog.netlify.app
MIT License
15 stars 12 forks source link

Handbook tutorial breaks with datalad-catalog v1.1.0 #410

Closed tmheunis closed 8 months ago

tmheunis commented 8 months ago

I'm following the datacat tutorial and noticed that my locally generated catalog does not look the same as the one in the tutorial screenshots after adding metadata, see screenshot:

Screenshot 2024-01-16 at 13 16 09

I looked at the config of the created catalog and it includes rules for metadata sources that seem like they are not applicable for the tutorial. E.g. the sources in the tutorial are:

"metadata_sources": {
      "key_source_map": {},
      "sources": [
         {
            "source_name": "stephan_manual",
            "source_version": "1",
            "source_parameter": {},
            "source_time": 1652340647.0,
            "agent_name": "Stephan Heunis",
            "agent_email": ""
         }
     ]
   }

but the config includes metadata source rules for metalad_studyminimeta and other sources. It seems like the config might not be a good match for the toy dataset of the tutorial, and might be preventing the correct fields from rendering?

I tried editing the config to include stephan_manual as a source (priority or single) and then re-added the metadata, but this didn't change anything. Not in the catalog rendering nor in the catalog metadata itself. This is an example from the catalog metadata after adding it:

datalad catalog-add --catalog data-cat --metadata toy_metadata.jsonl
/Users/theunis/virtualenvs/catalog/lib/python3.9/site-packages/urllib3/__init__.py:34: NotOpenSSLWarning: urllib3 v2 only supports OpenSSL 1.1.1+, currently the 'ssl' module is compiled with 'LibreSSL 2.8.3'. See: https://github.com/urllib3/urllib3/issues/3020
  warnings.warn(
catalog_add(ok): data-cat [Metadata record successfully added to catalog (dataset: dataset_id=5df8eb3a-95c5-11ea-b4b9-a0369f287950, dataset_version=dae38cf901995aace0dde5346515a0134f919523)]
catalog_add(ok): data-cat [Metadata record successfully updated in catalog (filetree of dataset: dataset_id=5df8eb3a-95c5-11ea-b4b9-a0369f287950, dataset_version=dae38cf901995aace0dde5346515a0134f919523)]
catalog_add(ok): data-cat [Metadata record successfully updated in catalog (filetree of dataset: dataset_id=5df8eb3a-95c5-11ea-b4b9-a0369f287950, dataset_version=dae38cf901995aace0dde5346515a0134f919523)]
action summary:
  catalog_add (ok: 3)

>> which gives:

{
...
    "name": null,
    "short_name": "My toy dataset",
    "description": null,
    "url": "https://github.com/jsheunis/multi-echo-super",
...
}

So somehow the name and description fields are not set?

Then I called catalog-add while explicitly passing the updated config file. This seems to work and the catalog renders correctly.

datalad catalog-add --catalog data-cat --metadata toy_metadata.jsonl --config-file data-cat/config.json
/Users/theunis/virtualenvs/catalog/lib/python3.9/site-packages/urllib3/__init__.py:34: NotOpenSSLWarning: urllib3 v2 only supports OpenSSL 1.1.1+, currently the 'ssl' module is compiled with 'LibreSSL 2.8.3'. See: https://github.com/urllib3/urllib3/issues/3020
  warnings.warn(
catalog_add(ok): data-cat [Metadata record successfully added to catalog (dataset: dataset_id=5df8eb3a-95c5-11ea-b4b9-a0369f287950, dataset_version=dae38cf901995aace0dde5346515a0134f919523)]
catalog_add(ok): data-cat [Metadata record successfully updated in catalog (filetree of dataset: dataset_id=5df8eb3a-95c5-11ea-b4b9-a0369f287950, dataset_version=dae38cf901995aace0dde5346515a0134f919523)]
catalog_add(ok): data-cat [Metadata record successfully updated in catalog (filetree of dataset: dataset_id=5df8eb3a-95c5-11ea-b4b9-a0369f287950, dataset_version=dae38cf901995aace0dde5346515a0134f919523)]
action summary:
  catalog_add (ok: 3)
Screenshot 2024-01-16 at 13 26 01

Also, the catalog metadata fields that were previously null now have the actual correct value. I also noticed that there is a new (duplicate) config file inside the metadata/<added-dataset> folder in the catalog.

tmheunis commented 8 months ago

Another data point: when I try to get the config of the created catalog after my edits to the config file, it gives the wrong result.

This is the config file that I edited and that I passed to the catalog-add command above:

cat data-cat/config.json | jq .
{
  "catalog_name": "DataCat",
  "logo_path": "",
  "link_color": "#fba304",
  "link_hover_color": "#af7714",
  "social_links": {
    "about": null,
    "documentation": "https://docs.datalad.org/projects/catalog/en/latest/",
    "github": "https://github.com/datalad/datalad-catalog",
    "mastodon": "https://fosstodon.org/@datalad",
    "x": "https://x.com/datalad"
  },
  "dataset_options": {
    "include_metadata_export": true
  },
  "property_sources": {
    "dataset": {
      "dataset_id": {
        "rule": "single",
        "source": "stephan_manual"
      },
      "dataset_version": {
        "rule": "single",
        "source": "stephan_manual"
      },
      "type": {
        "rule": "single",
        "source": "stephan_manual"
      },
      "children": {
        "rule": "merge",
        "source": "any"
      },
      "name": {
        "rule": "single",
        "source": "stephan_manual"
      },
      "short_name": {},
      "description": {
        "rule": "priority",
        "source": [
          "stephan_manual",
          "datacite_gin",
          "bids_dataset"
        ]
      },
      "doi": {},
      "url": {
        "rule": "merge",
        "source": "any"
      },
      "authors": {
        "rule": "merge",
        "source": "any"
      },
      "keywords": {
        "rule": "merge",
        "source": "any"
      },
      "license": {},
      "funding": {
        "rule": "merge",
        "source": "any"
      },
      "publications": {
        "rule": "merge",
        "source": "any"
      },
      "subdatasets": {
        "rule": "merge",
        "source": "any"
      },
      "additional_display": {
        "rule": "merge",
        "source": "any"
      },
      "top_display": {
        "rule": "merge",
        "source": "any"
      }
    }
  }
}

and this is the result of a catalog-get:

datalad catalog-get -c data-cat config | jq .
/Users/theunis/virtualenvs/catalog/lib/python3.9/site-packages/urllib3/__init__.py:34: NotOpenSSLWarning: urllib3 v2 only supports OpenSSL 1.1.1+, currently the 'ssl' module is compiled with 'LibreSSL 2.8.3'. See: https://github.com/urllib3/urllib3/issues/3020
  warnings.warn(
{
  "catalog_name": "DataCat",
  "logo_path": "",
  "link_color": "#fba304",
  "link_hover_color": "#af7714",
  "social_links": {
    "about": null,
    "documentation": "https://docs.datalad.org/projects/catalog/en/latest/",
    "github": "https://github.com/datalad/datalad-catalog",
    "mastodon": "https://fosstodon.org/@datalad",
    "x": "https://x.com/datalad"
  },
  "dataset_options": {
    "include_metadata_export": true
  },
  "property_sources": {
    "dataset": {
      "dataset_id": {
        "rule": "single",
        "source": "metalad_core"
      },
      "dataset_version": {
        "rule": "single",
        "source": "metalad_core"
      },
      "type": {
        "rule": "single",
        "source": "metalad_core"
      },
      "children": {
        "rule": "merge",
        "source": "any"
      },
      "name": {
        "rule": "single",
        "source": "metalad_studyminimeta"
      },
      "short_name": {},
      "description": {
        "rule": "priority",
        "source": [
          "catalog_readme",
          "metalad_studyminimeta",
          "datacite_gin",
          "bids_dataset"
        ]
      },
      "doi": {},
      "url": {
        "rule": "merge",
        "source": "any"
      },
      "authors": {
        "rule": "merge",
        "source": "any"
      },
      "keywords": {
        "rule": "merge",
        "source": "any"
      },
      "license": {},
      "funding": {
        "rule": "merge",
        "source": "any"
      },
      "publications": {
        "rule": "merge",
        "source": "any"
      },
      "subdatasets": {
        "rule": "merge",
        "source": "any"
      },
      "additional_display": {
        "rule": "merge",
        "source": "any"
      },
      "top_display": {
        "rule": "merge",
        "source": "any"
      }
    }
  }
}

It looks like catalog-get still retrieves the original default config from the package, and not the one that I expect it to, which is the one I edited.

tmheunis commented 8 months ago

I found another issue, this time related to the section https://handbook.datalad.org/en/latest/beyond_basics/101-182-catalog.html#catalog-configuration.

The step involves creating a new catalog with a custom configuration via the command:

datalad catalog-create --catalog custom-cat --metadata toy_metadata.jsonl --config-file cat_config.yml

and then setting the relevant home page.

However, when serving the catalog the browser displays nothing, just a blank screen. I inspected the page and saw several errors in the javascript console:

[Vue warn]: Error in render: "TypeError: Cannot read properties of undefined (reading 'about')"

(found in <Root>)
warn @ vue.js:634
vue.js:1906 TypeError: Cannot read properties of undefined (reading 'about')
    at Proxy.eval (eval at createFunction (vue.js:11698:14), <anonymous>:3:840)
    at Vue._render (vue.js:3572:24)
    at Vue.updateComponent (vue.js:4082:23)
    at Watcher.get (vue.js:4494:27)
    at Watcher.run (vue.js:4569:24)
    at flushSchedulerQueue (vue.js:4327:15)
    at Array.<anonymous> (vue.js:1998:14)
    at flushCallbacks (vue.js:1924:16)
logError @ vue.js:1906
app_component_dataset.js:47 subdatasets fetched!
app_component_dataset.js:49 from watcher
app_component_dataset.js:50 undefined
app_component_dataset.js:72 Object__ob__: Observer {value: {…}, dep: Dep, vmCount: 0}[[Prototype]]: Object
vue.js:634 [Vue warn]: Error in callback for watcher "dataset_ready": "TypeError: Cannot read properties of undefined (reading 'sources')"

found in

---> <Anonymous>
       <Root>
warn @ vue.js:634
vue.js:1906 TypeError: Cannot read properties of undefined (reading 'sources')
    at VueComponent.dataset_ready (app_component_dataset.js:109:66)
    at invokeWithErrorHandling (vue.js:1872:28)
    at Watcher.run (vue.js:4583:11)
    at flushSchedulerQueue (vue.js:4327:15)
    at Array.<anonymous> (vue.js:1998:14)
    at flushCallbacks (vue.js:1924:16)
logError @ vue.js:1906
:8000/metadata/5df8eb3a-95c5-11ea-b4b9-a0369f287950/dae38cf901995aace0dde5346515a0134f919523/config.json:1 

       Failed to load resource: the server responded with a status of 404 (File not found)

The last error looks like it's trying to find a config file within the dataset-level metadata directory, which fails. I checked this directory and can confirm that there is indeed no config.json file. On the catalog level, however, there is a config file:

cat custom-cat/config.json | jq .

{
  "catalog_name": "Toy Catalog",
  "logo_path": "artwork/datalad_logo_funky.svg",
  "link_color": "#32A287",
  "link_hover_color": "#A9FDAC",
  "property_sources": {
    "dataset": {}
  }
}
jsheunis commented 8 months ago

Thanks for the detailed issue @tmheunis! Nice analysis!

A few things that I can derive from this:

I will follow up with a reproducer and hopefully a fix for all issues.

Lastly, I think it would be a good idea to rerun the tutorial once the fixes are applied and capture screenshots of the output and update the handbook chapter with those. There have been catalog UI changes since that are not reflected in the current handbook chapter version.

jsheunis commented 8 months ago

Update:

However, I don't think the linked issue is the problem, because it seeems the catalog-get command (or rather it's Python internals) gets the incorrect configuration. This is evident both from your descriptions of the behaviour of the catalog-get command, and the catalog-add command (after editing the config) which does a get on the config internally. TODO: I need to inspect the python code involving the command line calls as well as the WebCatalog methods that work with the configuration (both on catalog-level and at dataset-level), because there is likely a bug in there.

The WebCatalog.get_config() method is indeed wrong, it uses the default config path from the package rather than the config path of the relevant catalog.

I think your second problem relates to the first, in that the configuration isn't handled properly upon catalog creation and metadata addition. The web application is looking for a file metadata/5df8eb3a-95c5-11ea-b4b9-a0369f287950/dae38cf901995aace0dde5346515a0134f919523/config.json that isn't there. This is a dataset-level config. TODO: I need to inspect whether the correct logic is executed (should the commands be outputting a dataset-level config?) and I need to improve the javascript code that it doesn't fail in this way if the expected file isn't available.

This assessment was incorrect. The actual problem is described here: https://github.com/datalad/datalad-catalog/issues/412