cidgoh / DataHarmonizer

A standardized browser-based spreadsheet editor and validator that can be run offline and locally, and which includes templates for SARS-CoV-2 and Monkeypox sampling data. This project, created by the Centre for Infectious Disease Genomics and One Health (CIDGOH), at Simon Fraser University, is now an open-source collaboration with contributions from the National Microbiome Data Collaborative (NMDC), the LinkML development team, and others.
MIT License
94 stars 26 forks source link

Document the fact that DH JSON is a bare list and not compatible with LinkML tools as is #390

Closed turbomam closed 1 year ago

turbomam commented 1 year ago

DH is welcome to add the DH JSON -> LinkML JSON (and vice versa) converters that I wrote

see

ddooley commented 1 year ago

Looking back on this, I think DH should input/output LinkML (JSON-LD) native JSON directly via browser, so need to understand the javascript required to do so. The existing "file -> Save as > .json" could be renamed to "file -> Save as > flat .json", and we could add a "file -> Save as > LinkML .json" option for the pure version. This avoids us having to use command line python tools as intermediary step.

@pkalita-lbl for comment.

(The LinkML data inlining options will come into play here later when we add 1-many data relations.)

pkalita-lbl commented 1 year ago

Let me see if I understand Mark's concern correctly. If I have a schema that implement's the typical LinkML container object pattern:

id: http://example.org/test
name: test
imports:
  - linkml:types
prefixes:
  linkml: https://w3id.org/linkml/

slots:
  s1:
    range: string
  s2:
    range: string
  entries:
    range: Entry
    multivalued: true

classes:
  Entry:
    slots:
      - s1
      - s2
  EntrySet:
    tree_root: true
    slots:
      - entries

I could point DataHarmonizer to the Entry class and it would show me an interface with two columns (for s1 and s2). I could enter some data and then export that data to JSON through the interface. It would look something like:

[
  {
    "s1": "row 1 col 1",
    "s2": "row 1 col 2",
  },
  {
    "s1": "row 2 col 1",
    "s2": "row 2 col 2",
  }
]

The issue is that I can't validate that file as-is using linkml-validate or using a generic JSON Schema validator and the JSON Schema derived from the LinkML schema. That's because LinkML doesn't really have a concept of an array at the root level -- hence the container object pattern.

So what Mark is saying is that if DataHarmonizer could somehow produce JSON that instead looks like:

{
  "entries": [
    {
      "s1": "row 1 col 1",
      "s2": "row 1 col 2",
    },
    {
      "s1": "row 2 col 1",
      "s2": "row 2 col 2",
    }
  ]
}

Now we have an object at the root level. That object corresponds to the EntrySet class in the schema and could be validated as such.

I don't have an exact proposal for how to resolve the situation, but it will probably involve a combination of logic to guess at the so-called container class and index slot (presumably via teaching DataHarmonizer to understand the tree_root metaslot), as well as ways to specify them manually (see also: https://linkml.io/linkml/data/csvs.html).